Method and computer program product for reducing fluorophore-specific bias

ABSTRACT

A method for fluorophore bias removal in microarray experiments in which the fluorophores used in microarray experiment pairs are reversed. Further, a method for calculating the individual errors associated with each measurement made in nominally repeated microarray experiments. This error measurement is optionally coupled with rank based methods in order to determine a probability that a cellular constituent is up or down regulated in response to a perturbation. Finally, a method for determining the confidence in the weighted average of the expression level of a cellular constituent in nominally repeated microarray experiments.

CROSS REFERENCE TO RELATED APPLICATION

This application is a continuation of application Ser. No. 10/058,696,filed Jan. 28, 2002, now abandoned which is a division of applicationSer. No. 09/222,596, filed Dec. 28, 1998, now U.S. Pat. No. 6,351,712B1, each of which is incorporated by reference herein in its entirety.

1 FIELD OF THE INVENTION

The field of this invention relates to methods for using data frommultiple repeated experiments to generate a confidence value for eachdata point, increase sensitivity, and eliminate systematic experimentalbias.

2 BACKGROUND OF THE INVENTION

2.1 Quantitative Measurement of Cellular Constituents

There is currently an explosive increase in the generation ofquantitative measurements of the levels of “cellular constituents”.Cellular constituents include gene expression levels, abundance of mRNAencoding specific genes, and protein expression levels in a biologicalsystem. Levels of various constituents of a cell, such as mRNA encodinggenes and/or protein expression levels, are known to change in responseto drug treatments and other perturbations of the cell's biologicalstate. Measurements of a plurality of such “cellular constituents”therefore contain a wealth of information about the affect ofperturbations on the cell's biological state. The collection of suchmeasurements is generally referred to as the “profile” of the cell'sbiological state.

There may be on the order of 100,000 different cellular constituents formammalian cells. Consequently, the profile of a particular cell istypically complex. The profile of any given state of a biological systemis often measured after the biological system has been subjected to aperturbation. Such perturbations include experimental or environmentalconditions(s) associated with a biological system such as exposure ofthe system to a drug candidate, the introduction of an exogenous gene,the deletion of a gene from the system, or changes in cultureconditions. Comprehensive measurements of cellular constituents, orprofiles of gene and protein expression and their response toperturbations in the cell, therefore have a wide range of utilityincluding the ability to compare and understand the effects of drugs,diagnose disease, and optimize patient drug regimens. In addition, theyhave further application in basic life science research.

Within the past decade, several technological advances have made itpossible to accurately measure cellular constituents and thereforederive profiles. For example, new techniques provide the ability tomonitor the expression level of a large number of transcripts at any onetime (see, e.g., Schena et al., 1995, Quantitative monitoring of geneexpression patterns with a complementary DNA micro-array, Science 270:467-470; Lockhart et al., 1996, Expression monitoring by hybridizationto high-density oligonucleotide arrays, Nature Biotechnology 14:1675-1680; Blanchard et al., 1996, Sequence to array: Probing thegenome's secrets, Nature Biotechnology 14, 1649; U.S. Pat. No.5,569,588, issued Oct. 29, 1996 to Ashby et al. entitled “Methods forDrug Screening”). In organisms for which the complete genome is known,it is possible to analyze the transcripts of all genes within the cell.With other organisms, such as humans, for which there is an increasingknowledge of the genome, it is possible to simultaneously monitor largenumbers of the genes within the cell.

In another front, the direct measurement of protein abundance has beenimproved by the use of microcolumm reversed-phase liquid chromatographyelectrospray ionization tandem mass spectrometry (LC/MS/MS) to directlyidentify proteins contained in mixtures. This technology promises topush the dynamic range for which protein abundance can be measured in abiological system. Using LC/MS/MS, McCormack et al. have demonstratedthat proteins presented in system mixtures can be readily identifiedwith a 30-fold difference in molar quantity, that the identificationsare reproducible, and that proteins within the mixture can be identifiedat low femtomole levels. McCormack et al., 1997, Direct analysis andidentification of proteins in mixtures by LC/MS/MS and databasesearching at the low-femtomole level, Anal. Chem. 69: 767-776. In areview of tandem mass spectrometry, Chait points out that an additionaladvantage of this technology is that it is orders of magnitude fasterthan more conventional approaches such as Edman sequencing. Chait, 1996,Trawling for proteins in the post-genome era, Nat. Biotech. 14: 1544.

Other technological advances have provided for the ability tospecifically perturb biological systems with individual geneticmutations. For example, Mortensen et al. describe a method for producingembryonic stem (ES) cell lines whereby both alleles are inactivated byhomologous recombination. Using the methods of Mortensen et al., it ispossible to obtain homozygous mutationally altered cells, i.e., doubleknockouts of ES cell lines. Mortensen et al. propose that their methodmay be generally applicable to other genes and to cell lines other thanES cells. M Mortensen et al. 1992, Production of homozygous mutant EScells with a single targeting construct, Cell Biol. 12: 2391-2395.

In another promising technology Wach et al. provide a dominantresistance module for selection of S. cerevisiae transformants whichentirely consists of heterologous DNA. The module can also be used toprovide PCR based gene disruptions. Wach et al., 1994, New heterologousmodules for classical or PCR-based gene disruptions in Saccharomycescerevisiae, Yeast 10: 1793-808.

Technological advances, such as the use of microarrays, are alreadybeing used in rug discovery (See e.g. Marton et al., 1998, Drug targetvalidation and identification of secondary drug target effects usingMicroarrays, Nature Medicine in press; Gray et al., 1998, Exploitingchemical libraries, structure, and genomics in the search for kinaseinhibitors, Science 281: 533-538).

Comparison of profiles with other profiles in a database (see, e.g.,U.S. Pat. No. 5,777,888, issued Jul. 7, 1998 to Rine et al. entitled“Systems for generating and analyzing stimulus-response output signalmatrices”) or clustering of profiles by similarity can give clues to themolecular targets of drugs and related functions, efficacy and toxicityof drug candidates and/or pharmacological agents. Such comparisons mayalso be used to derive consensus profiles representative of ideal drugactivities or disease states. Profile comparison can also help detectdiseases in a patient at an early stage and provide improved clinicaloutcome projections for a patient diagnosed with a disease.

2.2 Fluorophore Bias

The use of two fluorophores has been described by Shalon et al., 1996,“A microarray system for analyzing complex DNA samples using two-colorfluorescent probe hybridization,” Genome Research 6:629-645. The problemwith the approach put forth by Shalon is that each species of mRNAmolecule has a bias in its measured color ratio due to interaction ofthe fluorescent labeling molecule with either the reverse transcriptionof the mRNA or with the hybridization efficiency or both. Without anyerror correction scheme to account for this bias, the data from a singlemicroarray experiment, or even a plurality of nominal repeats of amicroarray experiment in which the various results are averages, willproduce an unacceptable error rate. As used herein, the term nominalrepeat or nominally repeated experiment refers to experiments that arerun under essentially the same or similar experimental conditions suchthat it would be useful to combine the results of the repeatedexperiments.

2.3 Inherent Error Rates of Cellular Constituent QuantitativeMeasurement Experiments

While the technological advances have allowed for the generation ofquantitative measurements of the levels cellular constituents, theexperiments are expensive. A single microarray experiment, or a singlegel electrophoresis place, can cost in the neighborhood of $100-$ 1000and higher. Also, it has only become apparent after many initialattempts to apply the data to actual commercial needs that individualexperiments suffer from high levels of false positives in the sense ofdeclaring significance where there really is none. Because of theexpense involved, and the high rate of false positives, no descriptionof robust methods for repeating and statistically combining multiple,nominally identical experiments for the express purpose of data qualityimprovement have been provided in the prior art.

The power of genome-wide cell profiling accomplished with microarrays isin its ability to survey response to known perturbations acrossessentially the entire set of cellular mechanisms. However, in any givenexperiment, typically only a small number of cellular constituents mayhave dramatic changes in abundance, where the vast majority areunchanged. There are exceptions, but cells have specific, biologicallyfairly insulated responses to stimuli, and so most profiles involve alarge set of constituents with ‘no-change’, and a much smaller set thatare either up or down regulated. For this reason, even a small falsealarm rate in the measurements can severely compromise their utility.For example, if one percent of cellular constituents actually respond ina typical experiment, the resolution in the measurement is twofold, andthe errors exceed twofold one percent of the time, then there will be asmany false alarms as true detections above a twofold threshold.

In general, the art has underappreciated the extensive amount of errorsthat are present in individual cellular constituent quantificationexperiments such as microarray or protein gel experiments. In additionto the difficulty posed by the fairly insulated response biologicalsystems have to any given perturbation, a substantial amount of error ispresent in any nominal microarray experiment due to artifacts such asunevenly printed DNA probe spots on the microarray, scratches dust andartifacts on the microarray, uneveness in signal brightness across themicroarray due to nonuniform DNA hybridization, and color stripes duefluorophore-specific biases of fluorophores used in the microarrayprocess.

One method to reduce the effects of these serious errors is to repeatthe experiment under identical conditions and to average the data.However, simple averaging of the data without any consideration of thenature of the underlying experimental errors does not provide anadequate solution to the problems the experimental errors introduce. Ifonly simple averaging of the data is performed, an excessive number ofnominal repeats would be required in order to reduce the effects oferror down to an acceptable level. However, because of the expenseinvolved in performing each cellular constituent quantificationexperiment, this is not a feasible solution. Accordingly, what is neededin the art are robust methods for combining the experimental results ofrepeated cellular constituent quantification experiments so that aminimal set of nominal repeats can provide an acceptable error rate.

Discussion of citation of a reference herein shall not be construed asan admission that such citation is prior art to the present invention.

3 SUMMARY OF THE INVENTION

This invention provides solutions for minimizing the number of times acellular constituent quantification experiment must be repeated in orderto produce data that has acceptable error levels. Accordingly, themethods of the present invention provide a novel method for fluorophorebias removal. This allows for the attenuation of fluorophore specificbiases to acceptable levels based on only two nominal repeats of acellular constituent quantification experiment. The present inventionfurther provides methods for combining nomimal repeats of a cellularconstituent quantification experiment based on rank order ofup-regulation or down-regulation. In these methods, cellular constituentup- or down-regulation data determined from nominal repeats of cellularconstituent quantification experiments are expressed by a novel metricthat is free of intensity dependent errors. Application of this metricbefore combining based on rank order provides a powerful method forremoving error from weakly expressing cellular constituents without anexcessive number of nominal repetitions of the expensive cellularconstituent quantification experiment.

Another aspect of the present invention is an improved method forcomputing a weighted average of individual cellular constituentmeasurements in nominally repeated cellular constituent quantificationexperiments. In particular, a novel method for calculating the errorassociated with each cellular constituent measurement is provided. Byusing this novel method for calculating error, the error bar in theweighted average is sharply attenuated. One skilled in the art willappreciate that these improved methods for computing a weighted averageare applicable to two-fluorophore (two-color) or single fluorophore(one-color) protocols.

One embodiment of the present invention provides a method of fluorophorebias removal comprising the steps of:

-   (a) labeling a first pool of genetic matter, derived from a    biological system representing a baseline state, with a first    fluorophore to obtain a first pool of fluorophore-labeled genetic    matter;-   (b) labeling a second pool of genetic matter, derived from a    biological system representing a perturbed state, with a second    fluorophore to obtain a second pool of fluorophore-labeled genetic    matter;-   (c) labeling a third pool of genetic matter, derived from said    biological system representing said baseline state, with said second    fluorophore to obtain a third pool of fluorophore-labeled genetic    matter;-   (d) labeling a fourth pool of genetic matter, derived from said    biological system representing said perturbed state, with said first    fluorophore to obtain a fourth pool of fluorophore-labeled genetic    matter;-   (e) independently contacting said first pool of fluorophore-labeled    genetic matter and said second pool of fluorophore-labeled genetic    matter with a first microarray under conditions such that    hybridization can occur and determining a first color ratio between    said first pool of fluorophore-labeled genetic matter and said    second pool of fluorophore-labeled genetic matter that binds to said    microarray;-   (f) independently contacting said third pool of fluorophore-labeled    genetic matter and said fourth pool of fluorophore-labeled genetic    matter with a second microarray under conditions such that    hybridization can occur and determining a second color ratio between    said third pool of fluorophore-labeled genetic matter and said    fourth pool of fluorophore-labeled genetic matter; and-   (g) computing an average color ratio by averaging said first color    ratio and said second color ratio.

Another embodiment of the invention provides a method for determining aprobability that an expression level of a cellular constituent in aplurality of paired differential microarray experiments is altered by aperturbation, wherein each paired differential microarray experiment insaid plurality of paired differential microarray experiments comprises afirst microarray experiment representing a baseline state of a firstbiological system, and a second microarray experiment representing aperturbed state of said first biological system, said method comprisingthe steps of

-   (a) determining an error distribution statistic by fitting a    reference pair of microarray experiments with an intensity    independent statistic, wherein said reference pair of microarray    experiments comprises a first reference microarray experiment, and a    second reference microarray experiment that is a nominal repeat of    said first reference microarray experiment;-   (b) selecting said cellular constituent from a set of cellular    constituents measured in said plurality of paired differential    microarray experiments, and, for each paired differential microarray    experiment in said plurality of paired differential microarray    experiments, determining an amount of change in expression level of    said cellular constituent between the second microarray experiment    and the first microarray experiment of said paired differential    microarray experiment using said error distribution statistic; and-   (c) determining said probability that said expression level of said    cellular constituent in said plurality of paired differential    microarray experiments is altered by said perturbation by combining    said amount of change in expression level of said cellular    constituent determined in step (b) for each paired differential    microarray experiment in said plurality of paired differential    microarray experiments using a rank based method.

Yet another embodiment of the invention is a method for determining aweighted mean differential intensity in an expression level of acellular constituent in a biological system in response to aperturbation, the method comprising:

-   -   (a) determining an error distribution statistic by fitting a        reference microarray experiment pair with an intensity        independent statistic, wherein the reference microarray        experiment pair comprises a first reference microarray        experiment and a second reference microarray experiment which is        a nominal repeat of the first reference microarray experiment;    -   (b) determining an amount of differential expression of the        cellular constituent a plurality of times;    -   (c) for each amount of differential expression determined in        accordance with (b), calculating a corresponding amount of error        based on a magnitude derived by the error distribution        statistic; and    -   (d) computing the weighted mean differential intensity by        inversely weighting each amount of the differential expression        of the cellular constituent determined in step (b) by the        corresponding amount of error determined in step (c) according        to the formula

$x = \frac{\sum\left( {x_{i}/\sigma_{i}^{2}} \right)}{\sum\left( {1/\sigma_{i}^{2}} \right)}$where x is the weighted mean differential intensity of the cellularconstituent, x_(i) is a differential expression measurement of thecellular constituent i and σ_(i) ² is a corresponding error for meandifferential intensity x_(i).

4 BRIEF DESCRIPTION OF THE FIGURES

FIG. 1 depicts some sources of measurement error present in microarrayfluorescent images. Panel (a) depicts unevenly printed DNA probe spots.Panel (b) depicts the effects of scratches, dust, and artifacts. Panel(c) depicts how spot positions drift away from a nominal measuring grid.Panel (d) depicts the effects of unevenness in the brightness across themicroarray due to uneven hybridization. Panel (e) depicts the effects ofcolor stripes on the microarray due to fluorophore-specific biases.

FIG. 2 illustrates the effect of deleting genes responsible for theproduction of calcineurin protein in the yeast S. Cerevisiae (CNA1 andCNA2). The figure contrasts the response profile of two yeast cultures,a native culture (Culture 1) and a culture in which CNA1 and CNA2 havebeen deleted (Culture 2). The horizontal axis is the log₁₀ of theintensity of the individual hybridized spots on the microarrary obtainedfrom the two yeast cultures, and therefore represents mRNA speciesabundance. The vertical axis is the log₁₀ of the ratio of the intensitymeasured for one fluorescent label (Culture 1) to that measured for theother label (Culture 2) (expression ratio). True signature genes of aCNA1/CNA2 mutation are identified as those deviating significantly fromthe log₁₀ (expression ratio)=0 line and are labeled.

FIG. 3 depicts the intensity-dependent bias that occurs in cellexpression profile experiments due to variance in fluorophore opticaldetection efficiencies as well as variance in fluorophore incorporationefficiencies.

FIG. 4 a is a color ratio vs. intensity plot for an experiment in whichboth cultures were the same background strain of the yeast S.cerevisiae. Genes with a distinct bias between a red and greenfluorophore are flagged. FIG. 4 b is the same experiment as depicted inFIG. 4 a except that usage of the red and green fluorophores isreversed. FIG. 4 c depicts the bias removal process of the invention,wherein FIG. 4 a and FIG. 4 b are combined to produce a response profilefree of fluorophore-specific biases.

FIG. 5 compares two identical response profiles that were performedunder identical experimental conditions. The figure shows thatexperimental errors decrease as a function of intensity (expressionlevel). Intensity independent contour lines illustrate a component ofthe error correction methods of the present invention.

FIG. 6 a shows a typical signature plot for a single experiment with thedrug Cyclosporin A. FIG. 6 b shows the results of applying a weightedaverage according to the methods of the present invention to fourrepeats of the experiment depicted in FIG. 6 a.

FIG. 7 illustrates a computer system useful for embodiments of theinvention.

5 DETAILED DESCRIPTION OF THE INVENTION

5.1. Introduction and general definitions

Perturbation: As used herein, a perturbation is the experimental orenvironmental condition(s) associated with a biological system.Perturbations may be achieved by exposure of a biological system to adrug candidate or pharmacologic agent, the introduction of an exogenousgene into a biological system, the deletion of a gene from thebiological system, changes in the culture conditions of the biologicalsystem, or any other art recognized method of perturbing a biologicalsystem. Further, perturbation of a biological system may be achieved bythe onset of disease in the biological system.

Genetic Matter: As used herein, the term “genetic matter” refers tonucleic acids such as messenger RNA (“mRNA”), complementary DNA(“cDNA”), genomic DNA (“gDNA”), DNA, RNA, genes, oligonucleotides, genefragments, and any combination thereof.

Fluorophore-labeled genetic matter: As used herein, the term“fluorophore-labeled genetic matter” refers to genetic matter that hasbeen labeled with a fluorescently-labeled probe (“fluorophore”).Fluorophores include, but are not limited to, fluorescein, lissamine,phycoerythrin, rhodamine (Perkin Elmer Cetus), Cy2, Cy3, Cy3.5, Cy5,Cy5.5, Cy7, FluorX (Amersham) and others (see, e.g., Kricka, 1992,Nonisotopic DNA Probe Techniques, Academic Press San Diego, Calif.).This DNA may be prepared by reverse transcription of mRNA or by(PCR/IVT) or (IVT) with use of fluorophores as those skilled in the artwill appreciate. See e.g. Gelder et al., 1990, “Amplified RNAsynthesized from limited quantities of heterogenous cDNA, Proc. Natl.Acad. Sci., USA, 87: 16.63-1667). As used herein, the term PCR refers tothe Polymerase Chain Reaction.

Biological System: As used herein, the term “biological system” isbroadly defined to include any cell, tissue, organ or multicellularorganism. For example, a biological system can be a cell line, a cellculture, a tissue sample obtained from a subject, a Homo sapien, amammal, a yeast substantially isogenic to Saccharomyces cerevisiae, orany other art recognized biological system. The state of a biologicalsystem can be measured by the content, activities or structures of itscellular constituents. The state of a biological system, as used herein,is determined by the state of a collection of cellular constituents,which are sufficient to characterize the cell or organism for anintended purpose including characterizing the effects of a drug or otherperturbation. The term “cellular constituent” encompasses any kind ofmeasurable biological variable. The measurements and/or observationsmade on the state of these constituents can be of their abundances(i.e., amounts or concentrations in a biological system), theiractivities, their states of modification (e.g., phosphorylation), orother art recognized measurements relevant to the physiological state ofa biological system. In various embodiments, this invention includesmaking such measurements and/or observations on different collections ofcellular constituents. These different collections of cellularconstituents are also called aspects of the biological state of abiological system.

One aspect of the biological state of a biological system (e.g., a cellor cell culture) usefully measured in the present invention is itstranscriptional state. The transcriptional state of a biological systemincludes the identities and abundances of the constituent RNA species,especially mRNAs, in the cell under a given set of conditions. Often, asubstantial fraction of all constituent RNA species in the biologicalsystem are measured, but at least a sufficient fraction is measured tocharacterize the action of a drug or other perturbation of interest. Thetranscriptional state of a biological system can be convenientlydetermined by measuring cDNA abundances by any of several existing geneexpression technologies. DNA arrays for measuring mRNA or transcriptlevel of a large number of genes can be employed to ascertain thebiological state of a system.

Another aspect of the biological state of a biological system usefullymeasured is its translational state. The translational state of abiological system includes the identities and abundances of theconstituent protein species in the biological system under a given setof conditions. Preferably a substantial fraction of all constituentprotein species in the biological system is measured, but at least asufficient fraction is measured to characterize the action of a drug ofinterest. The transcriptional state is often representative of thetranslational state.

Other aspects of the biological state of a biological system are also ofuse in this invention. For example, the activity state of a biologicalsystem includes the activities of the constituent protein species (andalso optionally catalytically active nucleic acid species) in thebiological system under a given set of conditions. As is known to thoseof skill in the art, the translational state is often representative ofthe activity state.

This invention is also adaptable, where relevant, to “mixed” aspects ofthe biological state of a biological system in which measurements ofdifferent aspects of the biological state of a biological system arecombined. For example, in one mixed aspect, the abundances of certainRNA species and of certain protein species, are combined withmeasurements of the activities of certain other protein species.Further, it will be appreciated from the following that this inventionis also adaptable to any other aspect of a biological state of abiological system that is measurable.

The biological state of a biological system (e.g., a cell or cellculture) can be represented by a profile of some number of cellularconstituents. Such a profile of cellular constituents can be representedby the vector S.S=[S ₁ , . . . S _(i) , . . . S _(k)]Where S_(i) is the level of the i'th cellular constituent, for example,the transcript level of gene i, or alternatively, the abundance oractivity level of protein i.

Quantitative Measurement of Cellular Constituents: MicroarraysDetermining the relative abundance of diverse individual sequences incomplex DNA samples is often accomplished using microarrays. See e.g.Shalon et al., 1996, “A Microarray System for Analyzing Complex SamplesUsing Two-color Fluorescent Probe Hybridization,” Genome Research6:639-645). Frequently, transcript arrays are produced by hybridizingdetectably labeled polynucleotides representing the mRNA transcriptspresent in a cell (e.g., fluorescently labeled cDNA synthesized fromtotal cell mRNA) to a microarray. A microarray is a surface with anordered array of binding (e.g., hybridization) sites for products ofmany of the genes in the genome of a cell or organism, preferably mostor almost all of the genes. Microarrays are highly reproducible andtherefore multiple copies of a given array can be produced and thenominal copies can be compared with each other. Preferably microarraysare small, usually smaller than 5 cm², and made from materials that arestable under binding (e.g., nucleic acid hybridization) conditions. Agiven binding site or unique set of binding sites in the microarray willspecifically bind the product of a single gene in the cell.

When cDNA complementary to the RNA of a cell is made and hybridized to amicroarray under suitable hybridization conditions, the level ofhybridization to the site in the array corresponding to any particulargene will reflect the prevalence in the cell of mRNA transcribed fromthat gene. For example, when detectably labeled (e.g., with afluorophore) cDNA complementary to the total cellular mRNA is hybridizedto a microarray, the site on the array corresponding to a gene (i.e.,capable of specifically binding the product of the gene) that is nottranscribed in the cell will have little or no signal (e.g., fluorescentsignal), and a gene for which the encoded mRNA is prevalent will have arelatively strong signal.

Microarrays are advantageous because nucleic acids representing twodifferent pools of nucleic acid can be hybridized to a microarray andthe relative signal from each pool can simultaneously be measured. Eachpool of nucleic acids may represent the state of a biological systembefore and after a perturbation. For example, a first nucleic acid poolmay be derived from a mRNA pool from a cell culture before exposing thecell culture to a pharmacological agent and a second cDNA pool may bederived from a mRNA pool derived from the same culture after exposingthe culture to a pharmacological agent. Alternatively, the two pools ofcDNA could represent pathway responses. Thus, a first cDNA library couldbe derived from the mRNA of a first aliquot (“pool”) of a cell culturethat has been exposed to a pathway perturbation and a second cDNAlibrary can be derived from the mRNA of a second aliquot (“pool”) of thesame cell culture wherein the second aliquot was not exposed to thepathway perturbation. As used herein, microarray experiments, includingthose described in this section, are referred to as (“differentialmicroarray experiments”). One skilled in the art will appreciate thatmany forms of differential microarray experiments other than the onesoutlined in this disclosure are within the scope of the definition of“differential microarray experiments”. Further, as used herein, the term“differential intensity measurement” refers to measurements made indifferential microarray experiments. For example, a differentialintensity measurement could be the difference between the brightness ofa position on a microarray, which corresponds to a cellular constituentof interest, after (i) the microarray has been contacted with DNAderived from a biological system that represents a baseline state and(ii) the microarray has been contacted with DNA derived from abiological system that represents a perturbed state. Further, oneskilled in the art will appreciate that the baseline state of abiological system may represent the wild-type state of the biologicalsystem. Alternatively, the baseline state of a biological system couldrepresent a different perturbed state of the biological system. Eachmicroarray experiment in a differential microarray experiment, orrepeated differential microarray experiment preferably utilizes the sameor similar microarray. Microarrays are considered similar if they areprepared from substantially isogenic biological systems and a majorityof the binding spots on each microarray are common. Thus, the microarrayused in repeated microarray experiments may be the same identicalmicroarray, wherein the microarray is washed between microarrayexperiments, or the microarray(s) used in repeated microarrayexperiments may be exact replicas of each other, or they may be similarto each other.

Regardless of the source of the two cDNA pools in differentialmicroarray experiments, each cDNA pool is distinctively labeled with adifferent dye if the two-fluorophore microarray format is chosen. Oneskilled in the art will appreciate that certain aspects of the presentinvention are not limited to the two-fluorophore format. Typically, eachcDNA pool is labeled by deriving fluorescently-labeled cDNA by reversetranscription of polyA⁺RNA in the presence of Cy3-(green) or Cy5-(red)deoxynucleotide triphosphates (Amersham). When the two cDNAs pools aremixed and hybridized to the microarray, the relative intensity of signalfrom each cDNA set is determined for each site on the array, and anyrelative difference in abundance of a particular mRNA detected.

When two different fluorescently labeled probes are used, such as CY3and CY5, the fluorescence emissions at each site of a microarray can bedetermined using scanning confocal laser microscopy. In one embodiment,a separate scan, using the appropriate excitation line, is carried outfor each of the two fluorophores used. Alternatively, a laser may beused that allows simultaneous specimen illumination at wavelengthsspecific to the two fluorophores and emissions from the two fluorophorescan be analyzed simultaneously (See e.g. Shalon et al., supra). Themicroarrays may be scanned with a laser fluorescent scanner with acomputer controlled X-Y stage and a microscope objective. Sequentialexcitation of the two fluorophores is achieved with a multi-line, mixedgas laser and the emitted light is split by wavelength and detected withtwo photomultiplier tubes. Fluorescence laser scanning devices aredescribed in Schena et al., 1996, Genome Res. 6: 639-645 and inreferences cited herein. Alternatively, the fiber-optic bundle describedby Ferguson et al., 1996, Nature Biotech. 14: 1681-1684, may be used tomonitor mRNA abundance levels at a large number of sites simultaneously.

Signals may be recorded and analyzed by computer, e.g., using a 12 bitanalog to digital board. The scanned image may be despeckled using agraphics program (e.g., Hijaak Graphics Suite) and then analyzed usingan image gridding program that creates a spreadsheet of the averagehybridization at each wavelength at each site. If necessary, anexperimentally determined correction for “cross talk” (or overlap)between the channels for the two fluorophores may be made.

As used herein, the term “microarray experiments” refers to the generalclass of experiments that are described in this section. One skilled inthe art will appreciate that microarray experiments may include the useof a single fluorophore rather than the two-fluorophore exampledescribed infra. Further, microarray experiments may be paired. Ifpaired, the first micro rray experiment in the pair could represent anominal biological system representing a baseline state. The secondmicroarray experiment in the pair could represent the nominal biologicalsystem after it has been subjected to a perturbation. Thus comparison ofthe paired microarray experiment would reveal changes in the state ofthe nominal biological system based upon the perturbation. Generally, asdiscussed, supra, these pairs of microarray experiments are referred toas “differential microarray experiments”.

Cell Expression Profiles An advantage of using two different cDNA poolsin microarray experiments is that a direct and internally controlledcomparison of the mRNA levels corresponding to each arrayed gene in twocell states can be made. This and related techniques for quantitativemeasurement of cellular constituents is generally referred to as cellconstituent profiling. Cell constituent profiling is typically expressedas changes, either in absolute level or the ratio of levels, between twoknown cell conditions, such as a response to treatment of a baselinestate with a pharmacological agent, as described in the previoussection.

Using the experimental procedures outlined in the preceding section, aratio of the emission of the two fluorophores may be calculated for anyparticular hybridization site on a DNA transcript array. This ratio isindependent of the absolute expression level of the cognate gene, but isuseful for genes whose expression is significantly modulated by drugadministration, gene deletion, or any other perturbation. As illustratedin FIGS. 2-6, two-fluorophore cell expression profiles are typicallyplotted on an x-y graph. The horizontal axis represents the log₁₀ of theratio of the mean intensity (which approximately reflects the level ofexpression of a corresponding mRNA derived from a gene) between thefirst and second pool of cDNA for each site on the microarray. Thevertical axis represents the log₁₀ of the ratio of the intensitymeasured for one fluorescent label, corresponding to the first pool ofcDNA, to that measured for the other fluorescent label, corresponding tothe second pool of cDNA, for each hybridization site on the microarray.

5.2 Fluorophore Bias Removal

As detailed in the background section the two-color fluorescenthybridization process put forth by Shalon et al., supra, introduces biasinto the profile analysis because each species of mRNA that is labeledwith fluorophore has a bias in its measured color ratio due tointeraction of the fluorescent labeling molecule (fluorophore) witheither the reverse transcription of the mRNA or with the hybridizationefficiency or both. This bias can be illustrated using the followingequations. If we represent the actual molecular abundance of aparticular species of mRNA k, representing cellular constituent or genek in the biological system of interest, as a(k), the color ratio forprobe k, ignoring any source of fluorophore bias may be represented as:r _(X/Y) =a ₁(k)/a ₂(k)  (1)where

the subscripts 1 and 2 refer to two independently extracted mRNAcultures in which abundances are being compared;

a₁(k) is the abundance of species k in mRNA culture 1;

a₂(k) is the abundance of species k in mRNA culture 2;

subscripts X and Y represent the two different fluorescent labels used;and

r_(X/Y) is the color ratio that ideally reflects abundance ratio a₁/a₂.

Equation (1) ideally represents the measurement plotted on the verticalaxis of FIGS. 2 thru 6. However the use of a fluorophore labeleddeoxynucleotide triphosphates affects the efficiency by which mRNA isreverse transcribed into cDNA and affects the efficiency to which thefluorophore-labeled cDNA hybridizes to the microarray. The preciseamount a specific fluorophore affects the transcription or hybridizationefficiency is highly dependent upon the precise molecular structure ofthe fluorophore used. Thus, a direct comparison of a₁(k) to a₂(k), whena₁(k) and a₂(k) are determined using different fluorophores, does notaccount for these fluorophore-specific affects on transcription andhybridization efficiency. The efficiency of a scanner at determining theabundances a₁(k) and a₂(k) on a microarray is also fluorophore specific.If we represent the combined efficiencies of particular fluorophore inextraction, labeling, reverse transcription, hybridization, and opticalscanning as E, a more realistic representation of the color ratiopresented in Equation 1 is:r _(X/Y) =a ₁(k)E _(X)(k)/a ₂(k)E _(Y)(k)  (2)where

r_(X/Y) is color ratio;

the subscripts 1 and 2 are as defined for equation 1;

a₁(k) and a₂(k) are as defined for equation 1;

subscripts X and Y are two fluorescent labels;

E_(X)(k) is the efficiency of fluorescent label X; and

E_(Y)(k) is the efficiency of fluorescent label Y.

In equation 2, Culture 1 has been analyzed using fluorophore X whereasCulture 2 has been analyzed using fluorophore Y. Now the color ratio ris related to the desired abundance ratio a₁/a₂ but includes a factordue to the fluorophore specific efficiency biases. If a secondhybridization experiment is performed, wherein Culture 1 is now analyzedwith fluorophore Y and Culture 2 is analyzed using fluorophore X, thecolor ratio in the second hybridization experiment may be representedas:r _(X/Y) ^((rev)) =a ₂(k)E _(X)(k)/a ₁(k)E _(Y)(k)  (3)where

r_(X/Y) ^((rev)) is color ratio in the reverse experiment; and

a₂(k), a₁(k), E_(X)(k), and E_(Y)(k) are as described for equation (2).

Performing hybridization experiments in pairs, with the label assignmentreversed in one member of the pair, allows for creation of a combinedaverage measurement in which the fluorophore specific bias is sharplyreduced. For example a pair of two-fluorophore hybridization experimentsmay be performed. The first two-fluorophore experiment would beperformed in accordance with equation (2) and the second two-fluorophorehybridization experiments would be performed according to equation (3).If the log of the ratio of the two experiments is taken, the combinedexperiment can be expressed as:

$\begin{matrix}{{\left( {1/2} \right)\left( {{\log\left( r_{X/Y} \right)} - {\log\;\left( r_{X/Y}^{rev} \right)}} \right)} = {{\log\;\left( {{a_{1}(k)}/{a_{2}(k)}} \right)} + \left( {{{\log\left( {{E_{X}(k)}/{E_{Y}(k)}} \right)} - {\log\left( {{E_{X}(k)}/{E_{Y}(k)}} \right)}} = {\log\;\left( {{a_{1}(k)}/{a_{2}(k)}} \right)}} \right.}} & (4)\end{matrix}$which is the desired log abundance ratio. Cancellation of the bias termslog(E_(X)(k)/E_(Y)(k)) and log(E_(X)(k)/E_(Y)(k)) relies on constancy ofthe biases between the first and second hybridization experiments ineach fluorophore-reversed pair. Equation (4) can be written equivalentlyusing ratios as found in equations (1)-(3) instead of differences of logratios. However, changes in constituent levels are most appropriatelyexpressed as the logarithm of the ratio of abundance in the pair ofconditions forming the differential measurement. This is because foldchanges are more meaningful than changes in absolute level,biologically.

This method of bias removal is particularly useful in two-colorhybridization experiments. FIG. 4 illustrates the bias removal method ofthe present invention. FIG. 4 a is a color ratio vs. intensity plot fora two-color hybridization experiment in which the two cultures used arenominally the same background strain of the yeast S. cerevisiae. Becausethe two cultures are nominally the same, it is expected that individualspots on the microarray would fluoresce with the same amount ofintensity for both of the fluorophores used. Experimental methods aredescribed in the experimental section infra. However, as is readilyapparent from FIG. 4 a, some of the spots on the microarray exhibitfluorophore-specific intensity. For example, spots on the microarray,corresponding to various genes in the yeast S. cerevisiae, in which theintensity of the ‘red’ fluorophore is a factor of two or more greaterthan the corresponding ‘green’ intensity are flagged because of theirstrong fluorophore-specific bias. FIG. 4 b shows the result of thefluorophore-reversed version of the experiment plotted in FIG. 4 a. Theflagged genes in FIG. 4 b now have opposite bias. FIG. 4 c shows theresult of combining the data of FIGS. 4 a and 4 b according to themethods of the present invention described above. The biases of theflagged genes have been greatly reduced.

The procedure for bias removal as described above may be applied inother contexts. For example, if cultures must be grown at certainpositions in an incubator, and harvested in a certain order, thepositions and order for two culture types may be reversed in asubsequent experiment and the results combined as described to reducesubtle biases due to temperature or latency differences.

5.3 Combination of Multiple Experiments using Rank-Based Methods

The prior art does not provide a clear method for optimally combiningthe results of multiple microarray experiments. The results of severalexperiments could be averaged. However, averaging does not provideinformation on the statistical significance of any given measurement foreach specific gene of interest in the microarray experiments. Thissection develops a sophisticated method for determining whether thestatistical significance of the up- or down-regulation measured forparticular genes of interest in multiple microarray experiments. Thesemethods could be applied to nominal repeats of a two-fluorophore DNAmicorarray experiment. Alternatively, these methods could be applied toone or more repeats of pairs of experiments, in which the firstexperiment in the pair represents a baseline state and the second memberof the paired repeats represents a biological state after a perturbationhas been applied.

If a gene of interest is present in the top 5% of up regulations in afirst and second nominal repeat of a microarray experiment, the chancethat it appeared that up regulated by chance in both arrays is only0.05*0.05=0.0025 or 0.25%, assuming systematic biases have been removed.Thus repeating the measurement allows a much higher level of confidencein declaring that the gene of interest is up regulated. In general, ifexpression ratios in any number of repeated experiments are expressed aspercentile rankings, the chance P(H₀) that any (pre-specified) gene ofinterest is not actually up regulated is

$\begin{matrix}{{P\left( H_{0}^{+} \right)} = {\prod\limits_{i}P_{i}}} & (5)\end{matrix}$where P, is the percentile rank in the i^(th) experiment, expressed as afraction (fifth percentile=0.05). The probability that the gene is notdown-regulated is given by

$\begin{matrix}{{P\left( H_{0}^{-} \right)} = {\prod\limits_{i}\left( {1 - P_{i}} \right)}} & (6)\end{matrix}$These rank-based methods provide a powerful way of reducing false alarmswith repeated measurements. For example, setting a threshold at theupper 5% of expression ratios in a hybridization to probes covering theyeast genome, which has approximately 6000 genes, would yield˜6000*0.05=300 false detections in a single experiment, but less thanone false detection on average if the same 5% threshold were appliedacross four experiment repeats (6000*(0.05)⁴). This rank combining hasthe advantage that it does not require any modeling of the detailederror behavior in the underlying hybridization experiments, other thanthe assumption of no systematic biases. The rank based method is anexample of a non-parametric statistical test for the significance ofobserved up- or down-regulations.

Percentile rankings such as equations (5) and (6) are based upon theassumption that the underlying error behavior is similar for all genes.This is not necessarily the case. For example, in FIG. 5, which plotsthe expression ratio of two nominative repeats of the same experiment,the weakly expressing genes, as expressed by log₁₀(intensity), have alog₁₀(expression ratio) that deviates from the ideal value of zero.Further, as exhibited by FIG. 5, the weaker expressing a particular geneis, the higher the tendency of the log₁₀(expression ratio) of the genefrom two nominal repeats of an experiment to deviate from zero. Thus,the low-abundance (weakly expressing and hence low-intensityhybridization) genes will tend to occupy the tails of the distributionof expression ratios (i.e. deviate from zero in accordance with FIG. 5)more often than the higher-abundance genes.

To account for the intensity-dependent error exhibited in hybridizationexperiments such as the one illustrated by FIG. 5, a measure of up- anddown-regulation that makes the error level independent of intensity canbe devised. This intensity-independent error level is derived by takingadvantage of a statistic that is capable of characterizing the errorenvelope exhibited in hybridization experiments. This error envelope isillustrated in FIG. 5 by contour lines. The many sources of error thatunderlie the experiments used to generate plots such as shown in FIG. 5generally fall into two categories—additive and multiplicative.Therefore the following statistical representation

$\begin{matrix}{d = \frac{\left( {X - Y} \right)}{\sqrt{\sigma_{X}^{2} + \sigma_{Y}^{2} + {f^{2}\left( {X^{2} + Y^{2}} \right)}}}} & (7)\end{matrix}$where X and Y are the brightness for a probe spot on the microarray withrespect to the X and Y fluorophores), σ_(X) ² is a variance term for Xand represents the additive error level in the X channel, σ_(Y) ² is avariance term for Y and represents the additive error level in the Ychannel, and f is the fractional multiplicative error level, provides aparticularly well suited model for fitting the resultant error.Alternatively, X and Y are the brightness of a probe spot correspondingto a cellular constituent of interest derived from a pair ofsingle-fluorophore experiments. In one such embodiment, the firstfluorophore (X) may optionally represent a biological system in a baseline state whereas the second fluorophore (Y) may represent thebiological system in a perturbed state. Regardless of whether a singlefluorophore or a dual-fluorophore embodiment is chosen, the fractionalmultiplicative error, f, is empirically derived by fitting thedenominator of equation 7 to the measured data. The denominator ofEquation (7) is the expected standard error of the numerator, so d hasunit variance. d is therefore an error distribution statistic that isindependent of intensity, and therefore applicable to rank methods. Anyother definition with the non-parametric properties of equation (7) isalso a good variable to use in the rank methods.

According to the methods of the present invention, the denominator ofequation (7) is used to generate the intensity independent contour linesshown in FIG. 5. Thus, for example, in FIG. 5, the contour lines griddedat +1 standard deviation have been chosen. Therefore, each contour lineabove or below zero on the vertical axis (log(Expression Ratio) δ=0)represents an incremental standard deviation of error in accordance withthe denominator of equation (7). The choice of using grid lines of±standard deviation according to the denominator of equation (7) iscompletely arbitrary. The contour lines could be gridded at anyconvenient value such as 0.25σ, 0.5σ, 2σ as long as the contour linesare plotted in accordance with the denominator of equation (7) or asimilar nonparamatric representation of error.

From FIG. 5 it is evident that the contour lines follow the errorenvelope. The value of d is proportional to the number of contours thata particular measurement falls away from log(Expression Ratio)=0. Thusthe errors are distributed with respect to the contours similarly at lowand at high intensity, and d has the desired property. One advantage ofplotting contour lines is that the amount of error associated with eachcellular constituent measured on the microarray can be calculated basedon information derived from the variance of all the cellularconstituents on the microarray across a plurality of measurements. Thus,by using grid lines as plotted in FIG. 5, the significance of anydeviation between X_(i) and Y_(i), in a two-color fluorescent probehybridization experiment, where i is a particular cellular constituent,will be placed in the context of the entire error envelope using anequation such as the denominator of equation 7. This provides anintensity independent method for determining the reliability ofmeasurements made of particular cellular constituents in microarrayexperiments including two-fluorophore or single-fluorophore experiments.

In addition to depending on intensity, error levels also may begene-specific, again violating the assumption underlying Equations (5)and (6). In this case we may define for any gene, in analogy to Equation(7),

$\begin{matrix}{d = \frac{\left( {X - Y} \right)}{\sigma_{X - Y}}} & (7.1)\end{matrix}$where σ_(X-Y) is the standard error (rms uncertainty) associated withthat gene. This uncertainty may be derived from repeated controlexperiments where X and Y are derived from the same biological system,in which case σ_(X-Y) is the observed standard deviation of X-Y for thatgene over the set of experiments. This definition of d then is similarlydistributed for all genes, and (5) and (6) may be used with ranking d.5.4 Combination of Multiple Experiments Using Weighted Average Protocols

Repeated measurements may be combined to yield a quantitative expressionlevel or expression ratio with smaller error bars than individualmeasurements. Weighting the averaging procedure according to theindividual experimental error levels requires knowing or assumingsomething about the error behavior for each measured quantity. Ingeneral, an unbiased weighted mean with minimum variance is achieved bythe formula

$\begin{matrix}{x = \frac{\sum\left( {x_{i}/\sigma_{i}^{2}} \right)}{\sum\left( {1/\sigma_{i}^{2}} \right)}} & (8)\end{matrix}$where x is the weighted mean of the cellular constituent being measured,x_(i), and each σ_(i) ² is the variance of an individual x_(i). See, forexample, equation 5-6 in “Data Reduction and Error Analysis for thePhysical Sciences”, 1969, Bevington, McGraw-Hill, New York, which isincorporated by reference herein in its entirety.

Each σ_(i) ² in equation (8) may be determined in a variety of ways. Oneapproach is to calculate the error envelope for a microarray experimentusing two nominal repeats of the two-fluorophore microarray experimentin which the only difference between the two experiments is that the twofluorophores utilized are reversed. See e.g. FIG. 4. Alternatively, onlyone fluorophore could be utilized. Therefore, there could be nodifference at all in the two nominal repeats that are paired in order todetermine an error envelope. Such a paired experiment is illustrated inFIG. 5. FIG. 5 also illustrates intensity independent contour lines thatare fitted in accordance with the denominator of equation (7). Todetermine individual σ_(i) ², for each individual measurement i, theintensity (x_(i)) is plotted on the appropriate reference plot, such asFIG. 5. For example, in FIG. 5, the intensity of individual measurementswould be plotted along the horizontal axis. Once the horizontal positionis determined, σ_(i) ² is calculated based upon the width of the ±1σintensity independent contour lines at position x_(i) on the referenceplot.

A general formula for the uncertainty of the mean is

$\begin{matrix}{\sigma_{x}^{2} = \frac{1}{\sum\left( \frac{1}{\sigma_{i}^{2}} \right)}} & (9)\end{matrix}$in accordance with formula 5-10 of “Data Reduction and Error Analysisfor the Physical Sciences”, supra. Note that when the errors associatedwith the different nominally repeated measurements are equal, the errorin the mean is N^(−1/2) times the individual errors.

In practice the individual errors, σ_(i) ², are themselves uncertain.Inspection of control experiments such as FIG. 5 indicates the roughdistribution of errors, but do not indicate whether individual genes ata particular intensity tend to have larger errors due to peculiaritiesof their RNA extraction or even biological function in the cell. Thus abetter estimate of the error in the weighted mean is obtained by addinga component to equation (9) that accounts for scatter in the repeatedmeasurements. If we denote the observed standard deviation for gene j ass_(j), the error in the mean may be described as:

$\begin{matrix}{\sigma_{x} = {\frac{1}{N}\left\lbrack {\left( \sqrt{\frac{1}{\sum\limits_{i}\frac{1}{\sigma_{i}^{2}}}} \right) + {\left( {N - 1} \right)*s_{j}}} \right\rbrack}} & (10)\end{matrix}$where N is the number of repeated measurements. Equation (10)transitions from Equation (9) to the value of the observed scatter,s_(j), as the number of repeats, N, becomes large. Note that s_(j) iscalculated according to traditional statistical methods, such that

$\begin{matrix}{s_{j} \cong {\frac{1}{N - 1}{\sum\limits_{i}\left( {x_{i} - \overset{\_}{x}} \right)^{2}}}} & (11)\end{matrix}$where N is the number of measurements, x_(i) are individual measurementsof the intensity of gene j in a particular microarray experiment and Xis the sample mean of the individual measurements. See e.g. equation2-10 in “Data Reduction and Error Analysis for the Physical Sciences”,supra, where s_(j)=σ². An estimate of the error of the mean, x, asdescribed by equation (10) is necessary because, equations such as (11)require a large number of nominal repeats (N) in order to be a truereflection of error. Estimates of error based on equation (9) do nottake into consideration the errors that particular measurement aresusceptible to as illustrated in FIG. 1 and as well as gene specificanomalies. One skilled in the art will note that other equations thataccomplish the transition from equation (9) to equation (10) arepossible.

FIG. 6 illustrates the reduction in error obtained with repeatedexperiments, and the consequent gain in information. FIG. 6 a is thesignature plot for a single experiment with the drug CsA, obtained asdescribed in the experimental section infra. One sigma error bars havebeen assigned based on the denominator of equation (7), with values forthe additive and multiplicative error levels taken from controlexperiments. Genes are flagged with their 1-sigma error bars only ifthey are more than 1.5 sigma from the line log(ratio)=0, i.e. only ifthey are up- or down-regulated with confidence greater than 95%. FIG. 6b show the results of forming a weighted mean of four repeats (N=4).Here the same criterion of 1.5 sigma has been applied for flagging errorbars, but many more genes are flagged. Comparison with FIG. 6 aindicates that the number of detections at the 95% confidence hasincreased from 4 to more than 200 genes. Thus, the example illustratesthe additional information about drug response that can be obtained withrepeated measurements provided that measurement error is appropriatelymodeled using equations such as (10).

5.5 Response Profiles

The responses of a biological system to a perturbation, such apharmacological agent, can be measured by observing the changes in thebiological state of the biological system. A response profile is acollection of changes of cellular constituents. The response profile ofa biological system (e.g., a cell or cell culture) to the perturbation mmay be defined as the vector v^((m)):v ^((m)) =[v ₁ ^((m)) , . . . v _(i) ^((m)) , . . . v _(k) ^((m))]  (12)where v_(i) ^(m) is the amplitude of response of cellular constituent iunder the perturbation m. In some embodiments of response profiles,biological response to the application of a pharmacological agent ismeasured by the induced change in the transcript level of at least 2genes, preferably more than 10 genes, more preferably more than 100genes and most preferably more than 1,000 genes.

In some embodiments, biological response profiles comprise simply thedifference between biological variables before and after perturbation.In some preferred embodiments, the biological response is defined as theratio of cellular constituents before and after a perturbation isapplied.

In some preferred embodiments, v_(i) ^(m) is set to zero if the responseof gene i is below some threshold amplitude or confidence leveldetermined from knowledge of the measurement error behavior. In suchembodiments, those cellular constituents whose measured responses arelower than the threshold are given the response value of zero, whereasthose cellular constituents whose measured responses are greater thanthe threshold retain their measured response values. This truncation ofthe response vector is suitable when most of the smaller responses areexpected to be greatly dominated by measurement error. After thetruncation, the response vector v^((m)) also approximates a ‘matcheddetector’ (see, e.g., Van Trees, 1968, Detection, Estimation, andModulation Theory Vol. I, Wiley & Sons) for the existence of similarperturbations. It is apparent to those skilled in the art that thetruncation levels can be set based upon the purpose of detection and themeasurement errors. For example, in some embodiments, genes whosetranscript level changes are lower than two fold or more preferably fourfold are given the value of zero.

In some preferred embodiments of response profiles, perturbations areapplied at several levels of strength. For example, different amounts ofa drug may be applied to a biological system to observe its response. Insuch embodiments, the perturbation responses may be interpolated byapproximating each by a single parameterized “model” function of theperturbation strength u. An exemplary model function appropriate forapproximating transcriptional state data is the Hill function, which hasadjustable parameters a, u₀, and n.

$\begin{matrix}{{H(u)} = \frac{{a\left( {u/u_{0}} \right)}^{n}}{1 + \left( {u/u_{0}} \right)^{n}}} & (13)\end{matrix}$The adjustable parameters are selected independently for each cellularconstituent of the perturbation response. Preferably, the adjustableparameters are selected for each cellular constituent so that the sum ofthe squares of the differences between the model function (e.g., theHill function, Equation 13) and the corresponding experimental data ateach perturbation strength is minimized. This preferable parameteradjustment method is known in the art as a least squares fit. Otherpossible model functions are based on polynomial fitting. More detaileddescription of model fitting and biological response has been disclosedin Friend and Stoughton, Methods of Determining Protein Activity LevelsUsing Gene Expression Profiles, U.S. Provisional Application Ser. No.60/084,742, filed on May 8, 1998, which is incorporated herein byreference in it's entirety for all purposes.5.6 Projected Profiles

The methods of the invention are useful for comparing augmented profilesthat contain any number of response profile and/or projected profiles.Projected profiles are best understood after a discussion of genesets,which are co-regulated genes. Projected profiles are useful foranalyzing many types of cellular constituents including genesets.

5.6.1 Co-Regulated Genes and Genesets

The use of genesets for representing projected profiles is described inthis and the following subsections and is also detailed in U.S. patentapplication Ser. No. 09/179,569 filed Oct. 27, 1998, now U.S. Pat. No.6,203,987 dated Mar. 20, 2001, entitled “Methods for using co-regulatedgenesets to enhance determination and classification of gene expression”by Friend et al., and U.S. patent application Ser. No. 09/220,275 tofiled Dec. 23, 1998 by Friend et al., entitled “Methods for usingco-regulated genesets to enhance determination and classification ofgene expression” which are both incorporated herein by reference intheir entireties. Certain genes tend to increase or decrease theirexpression in groups. Genes tend to increase or decrease their rates oftranscription together when they possess similar regulatory sequencepatterns, i.e., transcription factor binding sites. This is themechanism for coordinated response to particular signaling inputs (see,e.g., Madhani and Fink, 1998, The riddle of MAP kinase signalingspecificity, Transactions in Genetics 14:151-155; Arnone and Davidson,1997, The hardwiring of development: organization and function ofgenomic regulatory systems, Development 124:1851-1864). Separate geneswhich make different components of a necessary protein or cellularstructure will tend to co-vary. Duplicated genes (see, e.g., Wagner,1996, Genetic redundancy caused by gene duplications and its evolutionin networks of transcriptional regulators, Biol. Cybern. 74:557-567)will also tend to co-vary to the extent mutations have not led tofunctional divergence in the regulatory regions. Further, becauseregulatory sequences are modular (see, e.g., Yuh et al., 1998, Genomiccis-regulatory logic: experimental and computational analysis of a seaurchin gene, Science 279:1896-1902), the more modules two genes have incommon, the greater the variety of conditions under which they areexpected to co-vary their transcriptional rates. Separation betweenmodules also is an important determinant since co-activators also areinvolved. In summary therefore, for any finite set of conditions, it isexpected that genes will not all vary independently, and that there aresimplifying subsets of genes and proteins that will co-vary. Theseco-varying sets of genes form a complete basis in the mathematical sensewith which to describe all the profile changes within that finite set ofconditions.

5.6.2 Geneset Classification by Cluster Analysis

For many applications, it is desirable to find basis genesets that areco-regulated over a wide variety of conditions. A preferred embodimentfor identifying such basis genesets involves clustering algorithms (forreviews of clustering algorithms, see, e.g., Fukunaga, 1990, StatisticalPattern Recognition, 2nd Ed., Academic Press, San Diego; Everitt, 1974,Cluster Analysis, London: Heinemann Educ. Books; Hartigan, 1975,Clustering Algorithms, New York: Wiley; Sneath and Sokal, 1973,Numerical Taxonomy, Freeman; Anderberg, 1973, Cluster Analysis forApplications, Academic Press: New York).

In some embodiments employing cluster analysis, the expression of alarge number of genes is monitored as biological systems are subjectedto a wide variety of perturbations. A table of data containing the geneexpression measurements is used for cluster analysis. In order to obtainbasis genesets that contain genes which co-vary over a wide variety ofconditions multiple perturbations or conditions are employed. Clusteranalysis operates on a table of data which has the dimension m×k whereinm is the total number of conditions or perturbations and k is the numberof genes measured.

A number of clustering algorithms are useful for clustering analysis.Clustering algorithms use dissimilarities or distances between objectswhen forming clusters. In some embodiments, the distance used isEuclidean distance in multidimensional space:

$\begin{matrix}{{I\left( {x,y} \right)} = \left\{ {\sum\limits_{i}\left( {X_{i} - Y_{i}} \right)^{2}} \right\}^{1/2}} & (14)\end{matrix}$where I(x,y) is the distance between gene X and gene Y; X_(i) and Y_(i)are gene expression response under perturbation i. The Euclideandistance may be squared to place progressively greater weight on objectsthat are further apart. Alternatively, the distance measure may be theManhattan distance e.g., between gene X and Y, which is provided by:

$\begin{matrix}{{I\left( {x,y} \right)} = {\sum\limits_{i}{{X_{i} - Y_{i}}}^{2}}} & (15)\end{matrix}$Again, X_(i) and Y_(i) are gene expression responses under perturbationi. Some other definitions of distances are Chebychev distance, powerdistance, and percent disagreement. Percent disagreement, defined asI(x,y)=(number of X_(i)≈Y_(j))/i, is particularly useful for the methodof this invention, if the data for the dimensions are categorical innature. Another useful distance definition, which is particularly usefulin the context of cellular response, is I=1−r, where r is thecorrelation coefficient between the response vectors X, Y, also calledthe normalized dot product X*Y/|X∥Y|.

Various cluster linkage rules are useful for defining genesets. Singlelinkage, a nearest neighbor method, determines the distance between thetwo closest objects. By contrast, complete linkage methods determinedistance by the greatest distance between any two objects in thedifferent clusters. This method is particularly useful in cases whengenes or other cellular constituents form naturally distinct “clumps.”Alternatively, the unweighted pair-group average defines distance as theaverage distance between all pairs of objects in two different clusters.This method is also very useful for clustering genes or other cellularconstituents to form naturally distinct “clumps.” Finally, the weightedpair-group average method may also be used. This method is the same asthe unweighted pair-group average method except that the size of therespective clusters is used as a weight. This method is particularlyuseful for embodiments where the cluster size is suspected to be greatlyvaried (Sneath and Sokal, 1973, Numerical taxonomy, San Francisco: W. H.Freeman & Co.). Other cluster linkage rules, such as the unweighted andweighted pair-group centroid and Ward's method are also useful for someembodiments of the invention. See., e.g., Ward, 1963, J. Am. Stat Assn.58: 236; Hartigan, 1975, Clustering algorithms, New York: Wiley.

As the diversity of perturbations in the clustering set becomes verylarge, the genesets which are clearly distinguishable get smaller andmore numerous. However, even ver very large experiment sets, there aresmall genesets that retain their coherence. These genesets are termedirreducible genesets. Typically, a large number of diverse perturbationsare applied to obtain such irreducible genesets.

Often, the clustering of genesets is represented graphically and istermed a ‘tree’. Genesets may be defined based on the many smallerbranches of a tree, or a small number of larger branches by cuttingacross the tree at different levels. The choice of cut level may be madeto match the number of distinct response pathways expected. If little orno prior information is available about the number of pathways, then thetree should be divided into as many branches as are truly distinct.‘Truly distinct’ may be defined by a minimum distance value between theindividual branches. Typical values are in the range 0.2 to 0.4 where 0is perfect correlation and I is zero correlation, but may be larger forpoorer quality data or fewer experiments in the training set, or smallerin the case of better data and more experiments in the training set.

Preferably, ‘truly distinct’ may be defined with an objective test ofstatistical significance for each bifurcation in the tree. In one aspectof the invention, the Monte Carlo randomization of the experiment indexfor each cellular constituent's responses across the set of experimentsis used to define an objective test.

In some embodiments, the objective test is defined in the followingmanner:

Let p_(ki) be the response of constituent k in experiment i. Let II(i)be a random permutation of the experiment index. Then for each of alarge (about 100 to 1000) number of different random permutations,construct p_(kII(i)). For each branching in the original tree, for eachpermutation:

-   -   (1) perform hierarchical clustering with the same algorithm        (‘hclust’ in this case) used on the original unpermuted data;    -   (2) compute fractional improvement f in the total scatter with        respect to cluster centers in going from one cluster to two        clusters        f=1−ΣD _(k) ⁽¹⁾ /ΣD _(k) ⁽²⁾  (16)        where D_(k) is the square of the distance measure for        constituent k with respect to the center (mean) of its assigned        cluster. Superscript 1 or 2 indicates whether it is with respect        to the center of the entire branch or with respect to the center        of the appropriate cluster out of the two subclusters. There is        considerable freedom in the definition of the distance function        D used in the clustering procedure. In these examples, D=1−r,        where r is the correlation coefficient between the responses of        one constituent across the experiment set vs. the responses of        the other (or vs. the mean cluster response).

The distribution of fractional improvements obtained from the MonteCarlo procedure is an estimate of the distribution under the nullhypothesis that a given branching was not significant. The actualfractional improvement for that branching with the unpermuted data isthen compared to the cumulative probability distribution from the nullhypothesis to assign significance. Standard deviations are derived byfitting a log normal model for the null hypothesis distribution. Usingthis procedure, a standard deviation greater than about 2, for example,indicates that the branching is significant at the 95% confidence level.Genesets defined by cluster analysis typically have underlyingbiological significance.

Another aspect of the cluster analysis method provides the definition ofbasis vectors for use in profile projection described in the followingsections.

A set of basis vectors V has k×n dimensions, where k is the number ofgenes and n is the number of genesets.

$\begin{matrix}{V = \begin{bmatrix}V_{1}^{(1)} & \cdot & V_{1}^{(n)} \\ \cdot & \cdot & \cdot \\V_{k}^{(1)} & \cdot & V_{k}^{(n)}\end{bmatrix}} & (17)\end{matrix}$V^((n)) _(k) is the amplitude contribution of gene index k in basisvector n. In some embodiments, V^((n)) _(k)=1, if gene k is a member ofgeneset n, and V^((n)) _(k)=0 if gene k is not a member of geneset n. Insome embodiments, V^((n)) _(k) is proportional to the response of gene kin geneset n over the training data set used to define the genesets.

In some preferred embodiments, the elements V^((n)) _(k) are normalizedso that each V^((n)) _(k) has unit length by dividing by the square rootof the number of genes in geneset n. This produces basis vectors whichare not only orthogonal (the genesets derived from cutting theclustering tree are disjoint), but also orthonormal (unit length). Withthis choice of normalization, random measurement errors in profilesproject onto the V^((n)) _(k) in such a way that the amplitudes tend tobe comparable for each n. Normalization prevents large genesets fromdominating the results of similarity calculations.

5.6.3 Geneset Classification Based upon Mechanisms of Regulation

Genesets can also be defined based upon the mechanism of the regulationof genes. Genes whose regulatory regions have the same transcriptionfactor binding sites are more likely to be co-regulated. In somepreferred embodiments, the regulatory regions of the genes of interestare compared using multiple alignment analysis to decipher possibleshared transcription factor binding sites (Stormo and Hartzell, 1989,Identifying protein binding sites from unaligned DNA fragments, ProcNatl Acad Sci 86: 1183-1187; Hertz and Stormo, 1995, Identification ofconsensus patterns in unaligned DNA and protein sequences: alarge-deviation statistical basis for penalizing gaps, Proc of 3rd IntlConf on Bioinformatics and Genome Research, Lim and Cantor, eds., WorldScientific Publishing Co., Ltd. Singapore, pp. 201-216). For example, asExample 3, infra, shows, common promoter sequence responsive to Gcn4 in20 genes may be responsible for those 20 genes being co-regulated over awide variety of perturbations.

The co-regulation of genes is not limited to those with binding sitesfor the same transcriptional factor. Co-regulated (co-varying) genes maybe in the up-stream/down-stream relationship where the products ofup-stream genes regulate the activity of down-stream genes. It is wellknown to those of skill in the art that there are numerous varieties ofgene regulation networks. One of skill in the art also understands thatthe methods of this invention are not limited to any particular kind ofgene regulation mechanism. If it can be derived from the mechanism ofregulation that two genes are co-regulated in terms of their activitychange in response to perturbation, the two genes may be clustered intoa geneset.

Because of lack of complete understanding of the regulation of genes ofinterest, it is often preferred to combine cluster analysis withregulatory mechanism knowledge to derive better defined genesets. Insome embodiments, K-means clustering may be used to cluster genesetswhen the regulation of genes of interest is partially known. K-meansclustering is particularly useful in cases where the number of genesetsis predetermined by the understanding of the regulatory mechanism. Ingeneral, K-mean clustering is constrained to produce exactly the numberof clusters desired. Therefore, if promoter sequence comparisonindicates the measured genes should fall into three genesets, K-meansclustering may be used to generate exactly three genesets with greatestpossible distinction between clusters.

5.6.4 Representing Projected Profiles

The expression value of genes can be converted into the expression valuefor genesets. This process is referred to as projection. In someembodiments, the projection is as follows:P=[P ₁ , . . . P _(i) , . . . P _(n) ]=p*V  (18)wherein, p is the expression profile, P is the projected profile, P_(i)is expression value for geneset i and V is a predefined set of basisvectors. The basis vectors have been previously defined in Equation 17as:

$\begin{matrix}{V = \begin{bmatrix}V_{1}^{(1)} & \cdot & V_{1}^{(n)} \\ \cdot & \cdot & \cdot \\V_{k}^{(1)} & \cdot & V_{k}^{(n)}\end{bmatrix}} & (19)\end{matrix}$wherein v^((n)) _(k) is the amplitude of cellular constituent index k ofbasis vector n.

In one preferred embodiment, the value of geneset expression is simplythe average of the expression value of the genes within the geneset. Insome other embodiments, the average is weighted so that highly expressedgenes do not dominate the geneset value. The collection of theexpression values of the genesets is the projected profile.

5.6.5 Profile Comparison and Classification

Once the basis genesets are chosen, projected profiles P_(i) may beobtained for any set of profiles indexed by i. Similarities between theP_(i) may be more clearly seen than between the original profiles p_(i)for two reasons. First, measurement errors in extraneous genes have beenexcluded or averaged out. Second, the basis genesets tend to capture thebiology of the profiles p_(i) and so are matched detectors for theirindividual response components. Classification and clustering of theprofiles both are based on an objective similarity metric, call it S,where one useful definition isS _(ij) =S(P _(i) ,P _(j))=P _(i) *P _(j)/(|P _(i) ∥P _(j)|)  (20)

This definition is the generalized angle cosine between the vectorsP_(i) and P_(j). It is the projected version of the conventionalcorrelation coefficient between p_(i) and p_(j). Profile p_(i) is deemedmost similar to that other profile p_(j) for which S_(j) is maximum. Newprofiles may be classified according to their similarity to profiles ofknown biological significance, such as the response patterns for knowndrugs or perturbations in specific biological pathways. Sets of newprofiles may be clustered using the distarice metricD _(ij)=1−S _(ij)  (21)where this clustering is analogous to clustering in the original largerspace of the entire set of response measurements, but has the advantagesjust mentioned of reduced measurement error effects and enhanced captureof the relevant biology.

The statistical significance of any observed similarity S_(ij) may beassessed using an empirical probability distribution generated under thenull hypothesis of no correlation. This distribution is generated byperforming the projection, Equations (19) and (20) for many differentrandom permutations of the constituent index in the original profile p.That is, the ordered set p_(k) are replaced by p_(II(k)) where II(k) isa permutation, for ˜100 to 1000 different random permutations. Theprobability of the similarity S_(ij) arising by chance is then thefraction of these permutations for which the similarity S_(ij)(permuted) exceeds the similarity observed using the original unpermuteddata.

5.7 Methods for Determining Biological Response Profiles

This section provides some exemplary methods for measuring biologicalresponses as well as the procedures necessary to make the reagents usedin such methods.

5.7.1 Preparation of Microarrays

Microarrays are known in the art and consist of a surface to whichprobes that correspond in sequence to gene products (e.g., cDNAs, mRNAs,cRNAs, polypeptides, and fragments thereof), can be specificallyhybridized or bound at a known position. In one embodiment, themicroarray is an array (i.e., a matrix) in which each positionrepresents a discrete binding site for a product encoded by a gene(e.g., a protein or RNA), and in which binding sites are present forproducts of most or almost all of the genes in the organism's genome. Ina preferred embodiment, the “binding site” (hereinafter, “site”) is anucleic acid or nucleic acid analogue to which a particular cognate cDNAcan specifically hybridize. The nucleic acid or analogue of the bindingsite can be, e.g., a synthetic oligomer, a full-length cDNA, a less-thanfull length cDNA, or a gene fragment.

Although in a preferred embodiment the microarray contains binding sitesfor products of all or almost all genes in the target organism's genome,such comprehensiveness is not necessarily required. Usually themicroarray will have binding sites corresponding to at least about 50%of the genes in the genome, often at least about 75%, more often atleast about 85%, even more often more than about 90%, and most often atleast about 99%. Preferably, the microarray has binding sites for genesrelevant to the action of a drug of interest or in a biological pathwayof interest. A “gene” is an open reading frame (ORF) of preferably atleast 50, 75, or 99 amino acids from which a messenger RNA istranscribed in the organism (e.g.; if a single cell) or in some cell ina multicellular organism. The number of genes in a genome can beestimated from the number of mRNAs expressed by the organism, or byextrapolation from a well-characterized portion of the genome. When thegenome of the organism of interest has been sequenced, the number ofORFs can be determined and mRNA coding regions identified by analysis ofthe DNA sequence. For example, the Saccharomyces cerevisiae genome hasbeen completely sequenced and is reported to have approximately 6275open reading frames (ORFs) longer than 99 amino acids. Analysis of theseORFs indicates that there are 5885 ORFs that are likely to specifyprotein products (Goffeau et al., 1996, Life with 6000 genes, Science274: 546-567, which is incorporated by reference in its entirety for allpurposes). In contrast, the human genome is estimated to containapproximately 10⁵ genes.

5.7.2 Preparing Nucleic Acids for Microarrays

As noted above, the “binding site” to which a particular cognate cDNAspecifically hybridizes is usually a nucleic acid or nucleic acidanalogue attached at that binding site. In one embodiment, the bindingsites of the microarray are DNA polynucleotides corresponding to atleast a portion of each gene in an organism's genome. These DNAs can beobtained by,

e.g., polymerase chain reaction (PCR) amplification of gene segmentsfrom genomic DNA, cDNA (eg., by RT-PCR), or cloned sequences. PCRprimers are chosen, based on the known sequence of the genes or cDNA,that result in amplification of unique fragments (i.e., fragments thatdo not share more than 10 bases of contiguous identical sequence withany other fragment on the microarray). Computer programs are useful inthe design of primers with the required specificity and optimalamplification properties. In the case of binding sites corresponding tovery long genes, it will sometimes be desirable to amplify segments nearthe 3‘end of the gene so that when oligo-dT primed cDNA probes arehybridized to the microarray, less-than-full length probes will bindefficiently. Typically each gene fragment on the microarray will bebetween about 50 bp and about 2000 bp, more typically between about 100bp and about 1000 bp, and usually between about 300 bp and about 800 bpin length. PCR methods are well known and are described, for example, inInnis et al. eds., 1990, PCR Protocols: A Guide to Methods andApplications, Academic Press Inc., San Diego, Calif., which isincorporated by reference in its entirety for all purposes. Analternative means for generating the nucleic acid for the microarray isby synthesis of synthetic polynucleotides or oligonucleotides, e.g.,using N-phosphonate or phosphoramidite chemistries (Froehler et al.,1986, Nucleic Acid Res 14: 5399-5407; McBride et al., 1983, TetrahedronLett. 24: 245-248). Synthetic sequences are between about 15 and about500 bases in length, more typically between about 20 and about 50 bases.In some embodiments, synthetic nucleic acids include non-natural bases,e.g., inosine. As noted above, nucleic acid analogues may be used asbinding sites for hybridization. An example of a suitable nucleic acidanalogue is peptide nucleic acid (see, e.g., Egholm et al., 1993, PNAhybridizes to complementary oligonucleotides obeying the Watson-Crickhydrogen-bonding rules, Nature 365: 566-568; see also U.S. Pat. No.5,539,083).

In an alternative embodiment, the binding (hybridization) sites are madefrom plasmid or phage clones of genes, cDNAs (e.g., expressed sequencetags), or inserts therefrom (Nguyen et al., 1995, Differential geneexpression in the murine thymus assayed by quantitative hybridization ofarrayed cDNA clones, Genomics 29: 207-209). In yet another embodiment,the polynucleotide of the binding sites is RNA.

5.7.3 Attaching Nucleic Acids to the Solid Surface

The nucleic acid or analogue are attached to a solid support, which maybe made from glass, plastic (e.g., polypropylene, nylon),polyacrylamide, nitrocellulose, or other materials. A preferred methodfor attaching the nucleic acids to a surface is by printing on glassplates, as is described generally by Schena et al., 1995, Quantitativemonitoring of gene expression patterns with a complementary microarray,Science 270: 467-470. This method is especially useful for preparingmicroarrays of cDNA. See also DeRisi et al., 1996, Use of a cMicroarrayto analyze gene expression patterns in human cancer, Nature Genetics 14:457-460; Shalon et al., 1996, A microarray system for analyzing complexDNA samples using two-color fluorescent probe hybridization, Genome Res.6: 639-645; and Schena et al., 1995, Parallel human genome analysis;microarray-based expression of 1000 genes, Proc. Natl. Acad. Sci. USA93: 10539-11286.

A second preferred method for making microarrays is by makinghigh-density oligonucleotide arrays. Techniques are known for producingarrays containing thousands of oligonucleotides complementary to definedsequences, at defined locations on a surface using photolithographictechniques for synthesis in situ (see, Fodor et al., 1991,Light-directed spatially addressable parallel chemical synthesis,Science 251: 767-773; Pease et al., 1994, Light-directed oligonucleotidearrays for rapid DNA sequence analysis, Proc. Natl. Acad. Sci. USA 91:5022-5026; Lockhart et al., 1996, Expression monitoring by hybridizationto high-density oligonucleotide arrays, Nature Biotech 14: 1675; U.S.Pat. Nos. 5,578,832; 5,556,752; and 5,510,270, each of which isincorporated by reference in its entirety for all purposes) or othermethods for rapid synthesis and deposition of defined oligonucleotides(Blanchard et al., 1996, High-Density Oligonucleotide arrays, Biosensors& Bioelectronics 11: 687-90). When these methods are used,oligonucleotides (e.g., 20-mers) of known sequence are synthesizeddirectly on a surface such as a derivatized glass slide. Usually, thearray produced contains multiple probes against each target transcript.Oligonucleotide probes can be chosen to detect alternatively splicedmRNAs or to serve as various type of control.

Another preferred method of making microarrays is by use of an inkjetprinting process to synthesize oligonucleotides directly on a solidphase, as described, e.g., in co-pending U.S. patent application Ser.No. 09/008,120 filed on Jan. 16, 1998, by Blanchard entitled “ChemicalSynthesis Using Solvent Microdroplets”, which is incorporated byreference herein in its entirety.

Other methods for making microarrays, e.g., by masking (Maskos andSouthern, 1992, Nuc. Acids Res. 20: 1679-1684), may also be used. Inprincipal, any type of array, for example, dot blots on a nylonhybridization membrane (see Sambrook et al., Molecular Cloning—ALaboratory Manual (2nd Ed.), Vol. 1-3, Cold Spring Harbor Laboratory,Cold Spring Harbor, N.Y., 1989), could be used, although, as will berecognized by those of skill in the art, very small arrays will bepreferred because hybridization volumes will be smaller.

5.7.4 Generating Labeled Probes

Methods for preparing total and poly(A)+ RNA are well known and aredescribed generally in Sambrook et al., supra. In one embodiment, RNA isextracted from cells of the various types of interest in this inventionusing guanidinium thiocyanate lysis followed by CsCl centrifugation(Chirgwin et al., 1979, Biochemistry 18: 5294-5299). Poly(A)+ RNA isselected by selection with oligo-dT cellulose (see Sambrook et al.,supra). Cells of interest include wild-type cells, drug-exposedwild-type cells, modified cells, and drug-exposed modified cells.

Labeled cDNA is prepared from mRNA by oligo dT-primed or random-primedreverse transcription, both of which are well known in the art (see,e.g., Klug and Berger, 1987, Methods Enzymol. 152: 316-325). Reversetranscription may be carried out in the presence of a DNTP conjugated toa detectable label, most preferably a fluorescently labeled dNTP.Alternatively, isolated mRNA can be converted to labeled antisense RNAsynthesized by in vitro transcription of double-stranded cDNA in thepresence of labeled dNTPs (Lockhart et al., 1996, Expression monitoringby hybridization to high-density oligonucleotide arrays, Nature Biotech.14: 1675, which is incorporated by reference in its entirety for allpurposes). In alternative embodiments, the cDNA or RNA probe can besynthesized in the absence of detectable label and may be labeledsubsequently, e.g., by incorporating biotinylated dNTPs or rNTP, or somesimilar means (e.g., photo-cross-linking a psoralen derivative of biotinto RNAs), followed by addition of labeled streptavidin (e.g.,phycoerythrin-conjugated streptavidin) or the equivalent.

When fluorescently-labeled probes are used, many suitable fluorophoresare known, including fluorescein, lissamine, phycoerythrin, rhodamine(Perkin Elmer Cetus), Cy2, Cy3, Cy3.5, Cy5, Cy5.5, Cy7, Fluor X(Amersham) and others (see, e.g., Kricka, 1992, Nonisotopic DNA ProbeTechniques, Academic Press San Diego, Calif.). It will be appreciatedthat pairs of fluorophores are chosen that have distinct emissionspectra so that they can be easily distinguished.

In another embodiment, a label other than a fluorescent tabel is used.For example, a radioactive label, or a pair of radioactive labels withdistinct emission spectra, can be used (see Zhao et al., 1995, Highdensity cDNA filter analysis: a novel approach for large-scale,quantitative analysis of gene expression, Gene 156: 207; Pietu et al.,1996, Novel gene transcripts preferentially expressed in human musclesrevealed by quantitative hybridization of a high density cDNA array,Genome Res. 6: 492). However, because of scattering of radioactiveparticles, and the consequent requirement for widely spaced bindingsites, use of radioisotopes is a less-preferred embodiment.

In one embodiment, labeled cDNA is synthesized by incubating a mixturecontaining 0.5 mM dGTP, DATP and dCTP plus 0.1 mM dTTP plus fluorescentdeoxyribonucleotides (e.g., 0.1 mM Rhodamine 110 UTP (Perken ElmerCetus) or 0.1 mM Cy3 dUTP (Amersham)) with reverse transcriptase (e.g.,SuperScript™II, LTI Inc.) at 42° C. for 60 minutes.

5.7.5 Hybridization to Microarrays

Nucleic acid hybridization and wash conditions are optimally chosen sothat the probe “specifically binds” or “specifically hybridizes” to aspecific array site, i.e., the probe hybridizes, duplexes or binds to asequence array site with a complementary nucleic acid sequence but doesnot hybridize to a site with a non-complementary nucleic acid sequence.One polynucleotide sequence is considered complementary to another when,if the shorter of the polynucleotides is less than or equal to 25 bases,there are no mismatches using standard base-pairing rules or, if theshorter of the potynucleotides is longer than 25 bases, there is no morethan a 5% mismatch. Preferably, the polynucleotides are perfectlycomplementary (no mismatches). It can easily be demonstrated thatspecific hybridization conditions result in specific hybridization bycarrying out a hybridization assay including negative controls (see,e.g., Shalon et al., supra, and Chee et al., supra).

Optimal hybridization conditions will depend on the length (e.g.,oligomer versus polynucleotide greater than 200 bases) and type (e.g.,RNA, DNA, PNA) of labeled probe and immobilized polynucleotide oroligonucleotide. General parameters for specific (i.e., stringent)hybridization conditions for nucleic acids are described in Sambrook etal., supra, and in Ausubel et al., 1987, Current Protocols in MolecularBiology, Greene Publishing and Wiley-Interscience, New York. When themicroarrays of Schena et al. are used, typical hybridization conditionsare hybridization in 5×SSC plus 0.2% SDS at 65° C. for 4 hours followedby washes at 25° C. in low stringency wash buffer (1×SSC plus 0.2% SDS)followed by 10 minutes at 25° C. in high stringency wash buffer (0.1×SSCplus 0.2% SDS) (Shena et al., 1996, Proc. Natl. Acad. Sci. USA, 93:10614). Useful hybridization conditions are also provided in, e.g.,Tijessen, 1993, Hybridization With Nucleic Acid Probes, Elsevier SciencePublishers B. V. and Kricka, 1992, Nonisotopic DNA Probe Techniques,Academic Press San Diego, Calif.

5.8 Computer Implementations

The analytic methods described in the previous sections can beimplemented by use of the following computer systems and according tothe following programs and methods. FIG. 7 illustrates an exemplarycomputer system suitable for implementation of the analytic methods ofthis invention. Computer system 501 is illustrated as comprisinginternal components and being linked to external components. Theinternal components of this computer system include processor element502 interconnected with main memory 503. For example, computer system501 can be an Intel 8086-, 80386-, 80486-, Pentium®, or Pentium®-basedprocessor with preferably 32 MB or more of main memory.

The external components include mass storage 504. This mass storage canbe one or more hard disks (which are typically packaged together withthe processor and memory). Such hard disks are preferably of 1 GB orgreater storage capacity. Other external components include userinterface device 505, which can be a monitor, together with inputingdevice 506, which can be a “mouse”, or other graphic input devices (notillustrated), and/or a keyboard. A printing device 508 can also beattached to the computer 501.

Typically, computer system 501 is also linked to network link 507, whichcan be part of ah Ethernet link to other local computer systems, remotecomputer systems, or wide area communication networks, such as theInternet. This network link allows computer system 501 to share data andprocessing tasks with other computer systems.

Loaded into memory during operation of this system are several softwarecomponents, which are both standard in the art and special to theinstant invention. These software components collectively cause thecomputer system to function according to the methods of this invention.These software components are typically stored on mass storage 504.Software component 510 represents the operating system, which isresponsible for managing computer system 501 and its networkinterconnections. This operating system can be, for example, of theMicrosoft Windows' family, such as Windows 3.1, Windows 95, Windows 98,or Windows NT. Software component 511 represents common languages andfunctions conveniently present on this system to assist programsimplementing the methods specific to this invention. Many high or lowlevel computer languages can be used to program the analytic methods ofthis invention. Instructions can be interpreted during run-time orcompiled. Preferred languages include C/C++, FORTRAN and JAVA®. Mostpreferably, the methods of this invention are programmed in mathematicalsoftware packages that allow symbolic entry of equations and high-levelspecification of processing, including algorithms to be used, therebyfreeing a user of the need to procedurally program individual equationsor algorithms. Such packages include Matlab from Mathworks (Natick,Mass.), Mathematica from Wolfram Research (Champaign, IL), or S-Plusfrom Math Soft (Cambridge, Mass.). Accordingly, software component 512and/or 513 represents the analytic methods of this invention asprogrammed in a procedural language or symbolic package. In an exemplaryimplementation, to practice the methods of the present invention, a userfirst loads differential microarray experiment data into the computersystem 501. These data can be directly entered by the user from monitor505, keyboard 506, or from other computer systems linked by networkconnection 507, or on removable storage media such as a CD-ROM, floppydisk (not illustrated), tape drive (not illustrated), ZIP® drive (notillustrated) or through the network (507). Next the user causesexecution of expression profile analysis software 512 which performs themethods of the present invention.

In another exemplary implementation, a user first loads microarrayexperiment data into the computer system. This data is loaded into thememory from the storage media (504) or from a remote computer,preferably from a dynamic geneset database system, through the network(507). Next the user causes execution of software that performs thesteps of fluorophore bias removal, the rank-based methods of the presentinvention or the weighted averaging protocols of the present invention.

Alternative computer systems and software for implementing the analyticmethods of this invention will be apparent to one of skill in the artand are intended to be comprehended within the accompanying claims. Inparticular, the accompanying claims are intended to include thealternative program structures for implementing the methods of thisinvention that will be readily apparent to one of skill in the art.

6 EXPERIMENTAL

The following section details how reagents are prepared for theexperiments illustrated in FIGS. 2-6.

Construction, Growth and Drug-Treatment of Yeast Strains

The strains used in this study were constructed by standard techniques.See e.g. Schiestl et al., 1993, Introducing DNA into yeast bytransformation, Methods: A companion to Methods in Enzymology 5: 79-85.For experiments involving FK506, cells were grown for three generationsto a density of 1×10⁷ cells/ml in YAPD medium (YPD plus 0.004% adenine)supplemented with 10 mM calcium chloride as previously described byGarrett-Engele et al., 1995, Calcineurin, the Ca²⁺/calmodulin-dependentprotein phosphatase, is essential in yeast mutants with cell integritydefects and in mutants that lack functional vacuolar H(+)-ATPase, Mol.Cell. Biol. 15: 4103-4114. Where indicated, FK506 was added to a finalconcentration of 1 μg/ml 0.5 hr after inoculation of the culture.Cyclosporin A (CsA) was added to a concentration of 30 μg/ml. Cells werebroken by standard procedures (See e.g. Ausubel et al., CurrentProtocols in Molecular Biology, John Wiley & Sons, Inc. (New York),12.12.1-13.12.5) with the following modifications. Cell pellets wereresuspended in breaking buffer (0.2M Tris HCl pH 7.6, 0.5M NaCl, 10 mMEDTA, 1% SDS), vortexed for 2 minutes on a VWR multitube vortexer atsetting 8 in the presence of 60% glass beads (425-600 μm mesh; Sigma)and phenol:chloroform (50: 50, v/v). Following separation, the aqueousphase was reextracted and ethanol precipitated. Poly A⁺ RNA was isolatedby two sequential chromatographic purifications over oligo dT cellulose(NEB) using established protocols. See e.g. Ausubel et al., supra).

Preparation and Hybridization of the Labeled Sample

Fluorescently-labeled cDNA Was prepared, purified and hybridizedessentially as described by DeRisi et al. DeRisi et al., 1997, Exploringthe metabolic and genetic control of gene expression on a genomic scale,Science 278: 680-686. Briefly, Cy3- or Cy5-dUTP (Amersham) wasincorporated into cDNA during reverse transcription (Superscript II,LTI, Inc.) And purified by concentrating to less than 10 μl usingMicrocon-30 microconcentrators (Amicon). Paired cDNAs were resuspendedin 20-26 μl hybridization solution (3×SSC, 0.75 μg/ml poly A DNA, 0.2%SDS) and applied to the microarray under a 22×30 mm coverslip for 6 hrat 63° C., all according to DeRisi et al., (1997), supra.

Fabrication and Scanning of Microarrays

PCR products containing common 5′ and 3′ sequences (Research Genetics)were used as templates with amino-modified forward primer and unmodifiedreverse primers to PCR amplify 6065 ORFs from the S. cerevisiae genome.First pass success rate was 94%. Amplification reactions that gaveproducts of unexpected sizes were excluded from subsequent analysis.ORFs that could not be amplified from purchased templates were amplifiedfrom genomic DNA. DNA samples form 100 μl reactions were isopropanolprecipitated, resuspended in water, brought to 3×SSC in a total volumeof 15 and transferred to 384-well microtiter plates (Genetix). PCRproducts were spotted onto 1×3 inch polylysine-treated glass slides by arobot built according to specifications provided in Schena et al.,supra; DeRisi et al., 1996, Discovery and analysis of inflammatorydisease-related genes using micorarrays, PNAS USA, 94:2150-2155; andDeRisi et al., (1997). After printing, slides were processed followingpublished protocols. See DeRisi et al., (1997).

Microarrays were images on a prototype multi-frame CCD camera indevelopment at Applied Precision, Inc. (Seattle, Wash.). Each CCD imageframe was approximately 2 mm square. Exposure time of 2 seconds in theCy5 channel (white light through Chroma 618-648 nm excitation filter,Chroma 657-727 nm emission filter) and 1 second in the Cy3 channel(Chroma 535-560 nm excitation filter, Chroma 570-620 nm emission filter)were done consecutively in each frame before moving to the next,spatially contiguous frame. Color isolation between the Cy3 and Cy5channels was ˜100:1 or better. Frames were knitted together in softwareto make the complete images. The intensity of spots (˜100 μm) wasquantified from the 10 μm pixels by frame background subtraction andintensity averaging in each channel. Dynamic range of the resulting spotintensities was typically a ratio of 1000 between the brightest spotsand the background-subtracted additive error level. Normalizationbetween the channels was accomplished by normalizing each channel to themean intensities of all genes. This procedure is nearly equivalent tonormalization between channels using the intensity ratio of genomic DNAspots (See DeRisi et al., 1997)[11], but is possibly more robust sinceit is based on the intensities of several thousand spots distributedover the array.

Determination of Signature Correlation Coefficients and their ConfidenceLimits

Correlation coefficients between the signature ORFs of variousexperiments were calculated using

$\rho = {\sum\limits_{k}{x_{k}{y_{k}/\left( {\sum\limits_{k}{x_{k}^{2}{\sum\limits_{k}y_{k}^{2}}}} \right)^{1/2}}}}$where x_(k) is the log₁₀ of the expression ratio for the k'th gene inthe x signature, and y_(k) is the log₁₀ of the expression ratio for thek'th gene in the y signature. The summation is over those genes thatwere either up- or down-regulated in either experiment at the 95%confidence level. These genes each had a less than 5% chance of beingactually unregulated (having expression ratios departing from unity dueto measurement errors alone). This confidence level was assigned basedon an error model which assigns a lognormal probability distribution toeach gene's expression ratio with characteristic width based on theobserved scatter in its repeated measurements (repeated arrays at thesame nominal experimental conditions) and on the individual arrayhybridization quality. This latter dependence was derived from controlexperiments in which both Cy3 and Cy5 samples were derived from the sameRNA sample. For large numbers of repeated measurements the error reducesto the observed scatter. For a single measurement the error is based onthe array quality and the spot intensity.

Random measurement errors in the x and y signatures tend to bias thecorrelation toward zero. In most experiments the great majority of genesis not significantly affected but do exhibit small random measurementerrors. Selecting only the 95% confidence genes for the correlationcalculation, rather than the entire genome, reduces this bias and makesthe actual biological correlations more apparent.

Correlations between a profile and itself are unity by definition. Errorlimits on the correlation are 95% confidence limits based on theindividual measurement error bars, and assuming uncorrelated errors.They do not include the bias mentioned above; thus, a departure of pfrom unity does not necessarily mean that the underlying biologicalcorrelation is imperfect. However, a correlation of 0.7±0.1, forexample, is very significantly different from zero. Small (magnitude ofρ<0.2) but formally significant correlation in the tables and textprobably are due to small systematic biases in the Cy5/Cy3 ratios whichviolate the assumption of independent measurement errors used togenerate the 95% confidence limits. Therefore, these small correlationvalues should be treated as not significant. A likely source ofuncorrected systematic bias is the partially corrected scanner detectornonlinearity that differentially affects the Cy3 and Cy5 detectionchannels.

The 1 μg/ml FK506 treatment signature was compared to over 40 unrelateddeletion mutant or drug signatures. These control profiles hadcorrelation coefficients with the FK506 profile which were distributedaround zero (mean p=−0.03) with a standard deviation of 0.16 (data notshown) and none had correlations greater than p=0.38. Similarly, thecalcineurin mutant signature correlated well with the CsA-treatmentsignature (ρ=0.71±0.04) but not with the signatures from the negativecontrol signatures (mean Θ=−0.02 with a standard deviation of 0.18).

Quality Controls

End-to-end checks on expression ratio measurement accuracy were providedby analyzing the variance in repeated hybridizations using the same mRNAlabeled with both Cy3 and Cy5, and also using Cy3 and Cy5 mRNA samplesisolated from independent cultures of the same nominal strain andconditions. Biases undetected with this procedure, such as gene-specificbiases presumably due to differential incorporation of Cy3- and Cy5-dUTPinto cDNA, were minimized by performing hybridizations influorophore-reversed pairs, in which the Cy3/Cy5 labeling of thebiological conditions was reversed in one experiment with respect to theother. The expression ratio for each gene is then the ratio of ratiosbetween the two experiments in the pair. Other biases are removed byalgorithmic numerical detrending. The magnitude of these biases in theabsence of detrending and fluorophore reversal is typically on the orderof 30% in the ratio, but may be as high as twofold for some ORFs.

Expression ratios are based on mean intensities over each spot. Theoccasional smaller spots have fewer image pixels in the average. Thisdoes not degrade accuracy noticeably until the number of pixels fallsbelow ten, in which case the spot is rejected from the data set. Wanderof spot positions with respect to the nominal grid is adaptively trackedin array subregions by the image processing software. Unequal spotwander within a subregion greater than half a spot spacing isproblematic for the automated quantitating algorithms; in this case thespot is rejected from analysis based on human inspection of the wander.Any spots partially overlapping are excluded from the data set. Lessthan 1% of spots typically are rejected for these reasons.

7 REFERENCES CITED

All references cited herein are incorporated herein by reference intheir entirety and for all purposes to the same extent as if eachindividual publication or patent or patent application was specificallyand individually indicated to be incorporated by reference in itsentirety for all purposes.

Many modifications and variations of this invention can be made withoutdeparting from its spirit and scope, as will be apparent to thoseskilled in the art. The specific embodiments described herein areoffered by way of example only, and the invention is to be limited onlyby the terms of the appended claims, along with the full scope ofequivalents to which such claims are entitled.

What is claimed is:
 1. A computer program product embodied on a computerreadable storage medium, the computer readable storage medium encodedwith computer-executable instructions which, when executed by aprocessor, cause the processor to execute a method of reducingfluorophore-specific bias in gene expression data comprising: receivinga first plurality, second plurality, third plurality, and fourthplurality of fluorescent emission signals obtained from a two colorhybridization microarray experiment comprising steps (1)-(10): (1)labeling a first pool of genetic matter, derived from a biologicalsystem representing a baseline state, with a first fluorophore to obtaina first pool of fluorophore-labeled genetic matter; (2) labeling asecond pool of genetic matter, derived from a biological systemrepresenting a perturbed state, with a second fluorophore to obtain asecond pool of fluorophore-labeled genetic matter; (3) labeling a thirdpool of genetic matter, derived from said biological system representingsaid baseline state, with said second fluorophore to obtain a third poolof fluorophore-labeled genetic matter; (4) labeling a fourth pool ofgenetic matter, derived from said biological system representing saidperturbed state, with said first fluorophore to obtain a fourth pool offluorophore-labeled genetic matter; (5) contacting said first pool offluorophore-labeled genetic matter and said second pool offluorophore-labeled genetic matter with a first microarray underconditions such that hybridization can occur; (6) detecting said firstplurality of fluorescent emission signals at each of a plurality ofdiscrete loci on the first microarray from said first pool offluorophore-labeled genetic matter that becomes bound to said firstmicroarray under said conditions; (7) detecting said second plurality offluorescent emission signals at each of said plurality of discrete locion the first microarray from said second pool of fluorophore-labeledgenetic matter that becomes bound to said first microarray under saidconditions; (8) contacting said third pool of fluorophore-labeledgenetic matter and said fourth pool of fluorophore-labeled geneticmatter with a second microarray under conditions such that hybridizationcan occur; (9) detecting said third plurality of fluorescent emissionsignals at each of said plurality of discrete loci on the secondmicroarray from said third pool of fluorophore-labeled genetic matterthat becomes bound to said second microarray under said conditions; and(10) detecting said fourth plurality of fluorescent emission signals ateach of said plurality of discrete loci on the second microarray fromsaid fourth pool of fluorophore-labeled genetic matter that becomesbound to said second microarray under said conditions; wherein: saidfirst microarray and said second microarray are similar to each other,exact replicas of each other, or are identical; and said firstfluorophore and said second fluorophore are different; wherein thecomputer readable storage medium comprises computer executableinstructions for: normalizing a first fluorescent emission signal by afirst mean intensity, said first fluorescent emission signal beingmeasured at a first discrete locus of the plurality of discrete loci onthe first microarray, said first mean intensity being the mean intensityof the first plurality of fluorescent emission signals measured at eachof the plurality of discrete loci on the first microarray, therebycalculating a first normalized fluorescent emission signal; normalizinga second fluorescent emission signal by a second mean intensity, saidsecond fluorescent emission signal being measured at said first discretelocus, said second mean intensity being the mean intensity of the secondplurality of fluorescent emission signals measured at each of theplurality of discrete loci on the first microarray, thereby calculatinga second normalized fluorescent emission signal; determining a firstcolor ratio between said first normalized fluorescent emission signaland said second normalized fluorescent emission signal; normalizing athird fluorescent emission signal by a third mean intensity, said thirdfluorescent emission signal being measured at a second discrete locus ofa plurality of discrete loci on the second microarray, said third meanintensity being the mean intensity of the third plurality of fluorescentemission signals measured at each of the plurality of discrete loci onthe second microarray, thereby calculating a third normalizedfluorescent emission signal; normalizing a fourth fluorescent emissionsignal by a fourth mean intensity, said fourth fluorescent emissionsignal being measured at said second discrete locus, said fourth meanintensity being the mean intensity of the fourth plurality offluorescent emission signals measured at each of the plurality ofdiscrete loci on the second microarray, thereby calculating a fourthnormalized fluorescent emission signal; determining a second color ratiobetween said third normalized fluorescent emission signal and saidfourth normalized fluorescent emission signal, wherein the first colorratio and the second color ratio are determined for a same individualcomponent of genetic matter; wherein said first color ratio and saidsecond color ratio have opposite fluorophore bias: computing an averagecolor ratio by averaging said first color ratio and said second colorratio, wherein the first color ratio and the second color ratio are fora pair of hybridization experiments of similar, exact replica, oridentical microarrays and the same individual component of geneticmatter for both members of the pair, with fluorescent label assignmentreversed in one member of the pair, and wherein the averaging reducesfluorophore-specific bias in the average color ratio; wherein saidaverage color ratio is computed by the expression½(log(r _(x/y))−log(r _(x/y) ^((rev)))) by said instructions forcomputing, wherein r_(x/y) represents said first color ratio and r_(x/y)^((rev)) represents said second color ratio; and outputting the averagecolor ratio to a user, a user interface device, a monitor, auser-accessible computer readable storage medium, or a user-accessiblelocal or remote computer system; or displaying the average color ratio.2. The computer program product of claim 1 wherein said firstfluorophore and said second fluorophore are selected from the groupconsisting of Cy2-deoxynucleotide triphosphate, Cy3-deoxynucleotidetriphosphate, Cy3.5-deoxynucleotide triphosphate, Cy5-deoxynucleotidetriphosphate, Cy5.5-deoxynucleotide triphosphate, Cy7-deoxynucleotidetriphosphate, fluorescein, lissamine, phycoerythrin, and rhodamine. 3.The computer program product of claim 1 wherein said first microarrayand said second microarray are the same, the two color hybridizationmicroarray experiment further comprising: washing said first microarrayafter step (7).
 4. The computer program product of claim 1 wherein saidperturbed state is achieved by exposing said biological systemrepresenting said baseline state to a pharmacological agent.
 5. Thecomputer program product of claim 1 wherein said individual component ofgenetic matter comprises a messenger RNA, a complementary DNA, a genomicDNA, an oligonucleotide, or a gene fragment.
 6. A computer programproduct embodied on a computer readable storage medium, the computerreadable storage medium encoded with computer-executable instructionswhich, when executed by a processor, cause the processor to execute amethod of reducing fluorophore-specific bias in gene expression datacomprising: receiving a first color ratio and a second color ratioobtained from a two color hybridization microarray experiment comprisingsteps (1)-(12): (1) labeling a first pool of genetic matter, derivedfrom a biological system representing a baseline state, with a firstfluorophore to obtain a first pool of fluorophore-labeled geneticmatter; (2) labeling a second pool of genetic matter, derived from abiological system representing a perturbed state, with a secondfluorophore to obtain a second pool of fluorophore-labeled geneticmatter; (3) labeling a third pool of genetic matter, derived from saidbiological system representing said baseline state, with said secondfluorophore to obtain a third pool of fluorophore-labeled geneticmatter; (4) labeling a fourth pool of genetic matter, derived from saidbiological system representing said perturbed state, with said firstfluorophore to obtain a fourth pool of fluorophore-labeled geneticmatter; (5) contacting said first pool of fluorophore-labeled geneticmatter and said second pool of fluorophore-labeled genetic matter with afirst microarray under conditions such that hybridization can occur; (6)detecting a first fluorescent emission signal at a first discrete locuson the first microarray from said first pool of fluorophore-labeledgenetic matter that becomes bound to said first microarray under saidconditions; (7) detecting a second fluorescent emission signal at saidfirst discrete locus on the first microarray from said second pool offluorophore-labeled genetic matter that becomes bound to said firstmicroarray under said conditions; (8) determining said first color ratiobetween said first fluorescent emission signal and said secondfluorescent emission signal; (9) contacting said third pool offluorophore-labeled genetic matter and said fourth pool offluorophore-labeled genetic matter with a second microarray underconditions such that hybridization can occur; (10) detecting a thirdfluorescent emission signal at a second discrete locus on the secondmicroarray from said third pool of fluorophore-labeled genetic matterthat becomes bound to said second microarray under said conditions; (11)detecting a fourth fluorescent emission signal at said second discretelocus on the second microarray from said fourth pool offluorophore-labeled genetic matter that becomes bound to said secondmicroarray under said conditions; and (12) determining said second colorratio between said third fluorescent emission signal and said fourthfluorescent emission signal; wherein: said first microarray and saidsecond microarray are similar to each other, exact replicas of eachother, or are identical; said first color ratio and said second colorratio are determined for a same individual component of genetic matter;wherein said first color ratio and said second color ratio have oppositefluorophore bias; and said first fluorophore and said second fluorophoreare different; wherein the computer program mechanism further comprisescomputer executable instructions for: computing an average color ratioby averaging the first color ratio and the second color ratio by theexpression½(log(r _(x,y))−log(r _(x/y) ^((rev)))) wherein r_(x/y) represents saidfirst color ratio and r_(x/y) ^((rev)) represents said second colorratio, wherein the first color ratio and the second color ratio are fora pair of hybridization experiments of similar, exact replica, oridentical microarrays and the same individual component of geneticmatter for both members of the pair, with fluorescent label assignmentreversed in one member of the pair, and the average color ratio reducesfluorophore-specific bias as compared to the first color ratio and thesecond color ratio; and outputting the average color ratio to a user, auser interface device, a monitor, a user-accessible computer readablestorage medium, or a user-accessible local or remote computer system; ordisplaying the average color ratio.
 7. The computer program product ofclaim 6 wherein said first fluorophore and said second fluorophore areselected from the group consisting of Cy2-deoxynucleotide triphosphate,Cy3-deoxynucleotide triphosphate, Cy3.5-deoxynucleotide triphosphate,Cy5-deoxynucleotide triphosphate, Cy5.5-deoxynucleotide triphosphate,Cy7-deoxynucleotide triphosphate, fluorescein, lissamine, phycoerythrin,and rhodamine.
 8. The computer program product of claim 6 wherein saidfirst microarray and said second microarray are the same, the two colorhybridization microarray experiment further comprising: washing saidfirst microarray after said step (7).
 9. The computer program product ofclaim 6 wherein said perturbed state is achieved by exposing saidbiological system representing said baseline state to a pharmacologicalagent.
 10. The computer program product of claim 6 wherein saidindividual component of genetic matter comprises a messenger RNA, acomplementary DNA, a genomic DNA, an oligonucleotide, or a genefragment.
 11. A method of reducing fluorophore-specific bias in geneexpression data comprising: labeling a first pool of genetic matter,derived from a biological system representing a baseline state, with afirst fluorophore to obtain a first pool of first-fluorophore-labeledgenetic matter; labeling a second pool of genetic matter, derived from abiological system representing a perturbed state, with a secondfluorophore to obtain a first pool of second-fluorophore-labeled geneticmatter; labeling a third pool of genetic matter, derived from thebiological system representing the baseline state, with the secondfluorophore to obtain a second pool of second-fluorophore-labeledgenetic matter; labeling a fourth pool of genetic matter, derived fromthe biological system representing the perturbed state, with the firstfluorophore to obtain a second pool of first-fluorophore-labeled geneticmatter; contacting the first pool of first-fluorophore-labeled geneticmatter and the first pool of second-fluorophore-labeled genetic matterwith a first microarray under conditions such that hybridization canoccur; detecting a first plurality of first-fluorescent emission signalsat each of a plurality of discrete loci on the first microarray from thefirst pool of first-fluorophore-labeled genetic matter that becomesbound to the first microarray under said conditions; detecting a firstplurality of second-fluorescent emission signals at each of a pluralityof discrete loci on the first microarray from the first pool ofsecond-fluorophore-labeled genetic matter that becomes bound to thefirst microarray under said conditions; contacting the second pool ofsecond-fluorophore-labeled genetic matter and the second pool offirst-fluorophore-labeled genetic matter with a second microarray underconditions such that hybridization can occur; detecting a secondplurality of second-fluorescent emission signals at each of a pluralityof discrete loci on the second microarray from the second pool ofsecond-fluorophore-labeled genetic matter that becomes bound to thesecond microarray under said conditions; detecting a second plurality offirst-fluorescent emission signals at each of a plurality of discreteloci on the second microarray from the second pool offirst-fluorophore-labeled genetic matter that becomes bound to thesecond microarray under said conditions; normalizing a firstfirst-fluorescent emission signal by a first mean intensity, the firstfirst-fluorescent emission signal being measured at a first discretelocus of the plurality of discrete loci on the first microarray, thefirst mean intensity being a mean intensity of the first plurality offirst-fluorescent emission signals measured at each of the plurality ofdiscrete loci on the first microarray, thereby calculating a firstnormalized first-fluorescent emission signal; normalizing a firstsecond-fluorescent emission signal by a second mean intensity, the firstsecond-fluorescent emission signal being measured at the first discretelocus, the second mean intensity being a mean intensity of the firstplurality of second-fluorescent emission signals measured at each of theplurality of discrete loci on the first microarray, thereby calculatinga first normalized second-fluorescent emission signal; determining afirst color ratio between the first normalized first-fluorescentemission signal and the first normalized second-fluorescent emissionsignal; normalizing a second second-fluorescent emission signal by athird mean intensity, the second second-fluorescent emission signalbeing measured at a second discrete locus of the plurality of discreteloci on the second microarray, the third mean intensity being a meanintensity of the second plurality of second-fluorescent emission signalsmeasured at each of the plurality of discrete loci on the secondmicroarray, thereby calculating a second normalized second-fluorescentemission signal; normalizing a second first-fluorescent emission signalby a fourth mean intensity, the second first-fluorescent emission signalbeing measured at the second discrete locus, the fourth mean intensitybeing a mean intensity of the second plurality of second-fluorescentemission signals measured at each of the plurality of discrete loci onthe second microarray, thereby calculating a second normalizedfirst-fluorescent emission signal; determining a second color ratiobetween the second normalized second-fluorescent emission signal and thesecond normalized first-fluorescent emission signal; computing anaverage color ratio by averaging (i) the first color ratio between thefirst normalized first-fluorescent emission signal and the firstnormalized second-fluorescent emission signal and (ii) the second colorratio between the second normalized second-fluorescent emission signaland the second normalized first-fluorescent emission signal, wherein (i)and (ii) are for a pair of hybridization experiments of similar, exactreplica, or identical microarrays and a same individual component ofgenetic matter for both members of the pair, with label assignmentreversed in one member of the pair, and wherein the computing reducesfluorophore-specific bias for the average color ratio as compared to thefirst color ratio and the second color ratio; and outputting the averagecolor ratio to a user, a user interface device, a monitor, auser-accessible computer readable storage medium, or a user-accessiblelocal or remote computer system; or displaying the average color ratio;wherein: the first microarray and the second microarray are similar toeach other, exact replicas of each other, or are identical; the firstcolor ratio and the second color ratio are determined for a sameindividual component of genetic matter; the first color ratio and thesecond color ratio have opposite fluorophore bias; the first fluorophoreand the second fluorophore are different; and the average color ratio iscomputed via:½(log(r _(x/y))−log(r _(x/y) ^((rev)))) wherein r_(x/y) represents thefirst color ratio and r_(x/y) ^((rev)) represents the second colorratio, and wherein said normalizing, determining, computing, andoutputting steps are performed on a suitably programmed computer.